At the core of machine learning are training datasets. These collections, the most common are images, have labels (metadata) describing their contents and are used by an algorithm to learn how to classify them. A portion of the dataset is reserved for validation – testing the learned model with new, previously unseen, data. If all goes well, the model is then ready to classify entirely new data from the real world.
There are many such datasets and they are used repeatedly by AI researchers and developers to build their models.
And therein lies the problem.
Issues with datasets (e.g. lack of organizing principles, biased coding, poor metadata, and little or no quality control) result in models trained with those problems and reflecting this in their operation.
While over reliance on common datasets has long been a concern (see Holte, “Very simple classification rules perform well on most commonly used datasets”, Machine Learning, 1993), the issue has received widespread attention because of the work of Kate Crawford and Trevor Paglen. Their report, Excavating AI: The Politics of Images in Machine Learning Training Sets, and their online demonstration tool, ImageNet Roulette (no longer available as of September 27th), identified extraordinary bias, misidentification, racism, and homophobia. Frankly, it will shock you.
Calling their work the “archeology of datasets”, Crawford and Paglen uncovered what is well known to the LIS field: all classification is political, social, and contextual. In essence, any classification system is wrong and biased even if it is useful (see Bowker & Star, Sorting Things Out, 1999).
From an LIS perspective, how is ImageNet constructed? What is the epistemological basis, the controlled taxonomy, and the subclasses? Who added the metadata, under what conditions, and with what training and oversight?
ImageNet was crowdsourced using Amazon’s Mechanical Turk. Once again, therein lies the problem.
While ImageNet did use the WordNet taxonomy to control classifications, it is not clear how effectively this was managed. The results uncovered by Crawford and Paglen suggest not very much. This year many training datasets were taken offline or made unavailable, and many were severely culled (ImageNet will remove 600,000 images). However, these datasets are important; ML relies on them.
Bottom line: the LIS field has extensive expertise and practical experience in creating and managing classification systems and the requisite metadata. We are good at this, we know the pitfalls, and it is clear and compelling opportunity for LIS researchers and practitioners to be centrally involved in the creation of ML training datasets.