Training Datasets, Classification, and the LIS Field

At the core of machine learning are training datasets. These collections, the most common are images, have labels (metadata) describing their contents and are used by an algorithm to learn how to classify them. A portion of the dataset is reserved for validation – testing the learned model with new, previously unseen, data. If all goes well, the model is then ready to classify entirely new data from the real world.

There are many such datasets and they are used repeatedly by AI researchers and developers to build their models.

And therein lies the problem.

Issues with datasets (e.g. lack of organizing principles, biased coding, poor metadata, and little or no quality control) result in models trained with those problems and reflecting this in their operation.

While over reliance on common datasets has long been a concern (see Holte, “Very simple classification rules perform well on most commonly used datasets”, Machine Learning, 1993), the issue has received widespread attention because of the work of Kate Crawford and Trevor Paglen. Their report, Excavating AI: The Politics of Images in Machine Learning Training Sets, and their online demonstration tool, ImageNet Roulette (no longer available as of September 27th), identified extraordinary bias, misidentification, racism, and homophobia. Frankly, it will shock you.

Kate Crawford and Trevor Paglen 
(with their misidentified classifications from ImageNet Roulette
Kate Crawford and Trevor Paglen
(with their misidentified classifications from ImageNet Roulette

Calling their work the “archeology of datasets”, Crawford and Paglen uncovered what is well known to the LIS field: all classification is political, social, and contextual. In essence, any classification system is wrong and biased even if it is useful (see Bowker & Star, Sorting Things Out, 1999).

From an LIS perspective, how is ImageNet constructed? What is the epistemological basis, the controlled taxonomy, and the subclasses? Who added the metadata, under what conditions, and with what training and oversight?

ImageNet was crowdsourced using Amazon’s Mechanical Turk. Once again, therein lies the problem.

While ImageNet did use the WordNet taxonomy to control classifications, it is not clear how effectively this was managed. The results uncovered by Crawford and Paglen suggest not very much. This year many training datasets were taken offline or made unavailable, and many were severely culled (ImageNet will remove 600,000 images). However, these datasets are important; ML relies on them.

Bottom line: the LIS field has extensive expertise and practical experience in creating and managing classification systems and the requisite metadata. We are good at this, we know the pitfalls, and it is clear and compelling opportunity for LIS researchers and practitioners to be centrally involved in the creation of ML training datasets.


One thought on “Training Datasets, Classification, and the LIS Field”

  1. Of course, as soon as you post, something else of interest comes to your attention:

    Anja Bechmann and Geoffrey C. Bowker, “Unsupervised by Any Other Name: Hidden Layers of Knowledge Production in Artificial Intelligence on Social Media,” Big Data & Society 6, no. 1 (January 2019): 205395171881956,

    Artificial Intelligence (AI) in the form of different machine learning models is applied to Big Data as a way to turn data into
    valuable knowledge. The rhetoric is that ensuing predictions work well—with a high degree of autonomy and automation. We argue that we need to analyze the process of applying machine learning in depth and highlight at what point
    human knowledge production takes place in seemingly autonomous work. This article reintroduces classification theory
    as an important framework for understanding such seemingly invisible knowledge production in the machine learning
    development and design processes. We suggest a framework for studying such classification closely tied to different steps
    in the work process and exemplify the framework on two experiments with machine learning applied to Facebook data
    from one of our labs. By doing so we demonstrate ways in which classification and potential discrimination take place in
    even seemingly unsupervised and autonomous models. Moving away from concepts of non-supervision and autonomy
    enable us to understand the underlying classificatory dispositifs in the work process and that this form of analysis
    constitutes a first step towards governance of artificial intelligence.

Leave a Reply

Your email address will not be published. Required fields are marked *