Home / Series / Chaos Communication Congress / Aired Order / Season 39 / Episode 107

a media-almost-archaeology on data that is too dirty for "AI"

when datasets are scaled up to the volume of (partial) internet, together with the idea that scale will average out the noise, large dataset builders came up with a human-not-in-the-loop, cheaper-than-cheap-labor method to clean the datasets: heuristic filtering. Heuristics in this context are basically a set of rules came up by the engineers with their imagination and estimation to work best for their perspective of “cleaning”. Most datasets use heuristics adopted from existing ones, then add some extra filtering rules for specific characteristics of the datasets. I would like to invite you to have a taste together of these silent, anonymous yet upheld estimations and not-guaranteed rationalities in current sociotechnical artifacts, and on for whom these estimations are good-enough, as it will soon be part our technological infrastructures.

English
  • Originally Aired December 29, 2025
  • Runtime 40 minutes
  • Production Code 1798
  • Created December 22, 2025 by
    r4m3u5
  • Modified December 29, 2025 by
    r4m3u5