Skip to main content
BackTraining & validation data
Home/Risks/Tan, Taeihagh & Baxter (2022)/Training & validation data

Training & validation data

The Risks of Machine Learning Systems

Tan, Taeihagh & Baxter (2022)

Sub-category

"This is the risk posed by the choice of data used for training and validation."(p. 8)

Supporting Evidence (5)

1.
"Control over training and validation data: Using pretrained models (e.g., GPT-3 [27], BERT [52], Inception [174]) for processing unstructured data such as images and text is becoming increasingly common. While this can significantly improve performance, the trade-off is reduced control over the training data for teams that do not pretrain their own models and simply build on top of publicly released models or machine learning API services (e.g., translation). Given the discovery of systemic labeling errors, stereotypes, and even pornographic content in popular datasets such as ImageNet [16, 135, 187], it is important to consider the downstream ramifications of using models pretrained on these datasets. The studies mentioned above were performed on publicly available datasets; Birhane et al. further highlight the existence of pretrained models trained on private datasets that cannot be independently audited by researchers [16]."(p. 8)
2.
"Demographic representativeness: Due to the data-driven nature of machine learning, training an ML system on data that insufficiently represent underrepresented demographics may lead to disproportionate underperformance for these demographics during inference, especially if unaccounted for during model design. This is representativeness in the quantitative sense, of the “number of examples in the training/validation set”, and the performance disparity can result in allocational harms where the minority demographics have reduced access to resources due to the poorer performance. For example, poor automated speech recognition performance for minority dialect speakers (e.g., African American Vernacular English) will have devastating consequences in the courtroom. We may also think of representativeness in the qualitative sense, where stereotypical examples are avoided and fairer conceptions of these demographics are adopted. Since labels are often crowdsourced, there is the additional risk of bias being introduced via the annotators’ sociocultural backgrounds and desire to please."(p. 8)
3.
"Similarity of data distributions: Where demographic representativeness deals with the proportion of subpopulations in the dataset, distributional similarity is more concerned with major shifts between training and deployment distributions. This can occur when there is no available training data matching a niche deployment setting and an approximation has to be used. However, this comes with the risk of domain mismatch and consequently, poorer performance. For example, an autonomous vehicle trained on data compiled in Sweden would not have been exposed to jumping kangaroos. Subsequently deploying the vehicle in Australia will result in increased safety risk from being unable to identify and avoid them, potentially increasing the chance of a crash."(p. 9)
4.
"Quality of data sources: The popular saying, “garbage in, garbage out”, succinctly captures the importance of data quality for ML systems. Common factors affecting the quality of labeled data include annotator expertise level, inter-annotator agreement, overlaps between validation and training/pretraining data. The recent trend towards training on increasingly large datasets scraped from the web makes manual data annotation infeasible due to the sheer scale. While such datasets satiate increasingly large and data-hungry neural networks, they often contain noisy labels, harmful stereotypes, and even pornographic content. Kreutzer et al. manually audited several multilingual web-crawled text datasets and found significant issues such as wrongly labeled languages, pornographic content, and non-linguistic content. An even greater concern from the ML perspective is the leakage of benchmark test data and machine-generated data (e.g., machine-translated text, GAN-generated images) into the training set. The former was only discovered after training GPT-3, while the latter is inevitable in uncurated web-crawled data due to its prevalence on the Internet. Researchers have also discovered bots completing data annotation tasks on Amazon Mechanical Turk, a platform used to collect human annotations for benchmark datasets. However, cleaning such datasets is no mean feat: blocklist-based methods for content filtering may erase reclaimed slurs, minority dialects, and other non-offensive content, inadvertently harming the minority communities they belong to. In fact, the very notion of cleaning language datasets may reinforce sociocultural biases and deserves further scrutiny."(p. 9)
5.
"Presence of personal information: The presence of personal information in the training data increases the risk of the ML model memorizing this information, as deep neural networks have been shown to do. This could lead to downstream consequences for privacy when membership inference attacks are used to extract such information. We discuss this in greater detail in Section 5.4."(p. 9)

Part of First-Order Risks

Other risks from Tan, Taeihagh & Baxter (2022) (17)