Lower performance for some languages and social groups
Accuracy and effectiveness of AI decisions and actions are dependent on group membership, where decisions in AI system design and biased training data lead to unequal outcomes, reduced benefits, increased effort, and alienation of users.
"LMs are typically trained in few languages, and perform less well in other languages [95, 162]. In part, this is due to unavailability of training data: there are many widely spoken languages for which no systematic efforts have been made to create labelled training datasets, such as Javanese which is spoken by more than 80 million people [95]. Training data is particularly missing for languages that are spoken by groups who are multilingual and can use a technology in English, or for languages spoken by groups who are not the primary target demographic for new technologies."(p. 217)
Supporting Evidence (2)
"Training data can also be lacking when relatively little digitised text is available in a language, e.g. Seychellois Creole [95]. Disparate performance can also occur based on slang, dialect, sociolect, and other aspects that vary within a single language [23]. One reason for this is the underrepresentation of certain groups and languages in training corpora, which often disproportionately affects communities who are marginalised, excluded, or less fre- quently recorded, also referred to as the ”undersampled majority” [150]"(p. 217)
"In the case of LMs where great benefits are anticipated, lower performance for some groups risks creating a distribution of bene- fits and harms that perpetuates existing social inequities and raises social justice concerns [15, 79, 95].]"(p. 217)
Part of Risk area 1: Discrimination, Hate speech and Exclusion
Other risks from Weidinger et al. (2022) (25)
Risk area 1: Discrimination, Hate speech and Exclusion
1.2 Exposure to toxic contentRisk area 1: Discrimination, Hate speech and Exclusion > Social stereotypes and unfair discrimination
1.1 Unfair discrimination and misrepresentationRisk area 1: Discrimination, Hate speech and Exclusion > Hate speech and offensive language
1.2 Exposure to toxic contentRisk area 1: Discrimination, Hate speech and Exclusion > Exclusionary norms
1.1 Unfair discrimination and misrepresentationRisk area 2: Information Hazards
2.1 Compromise of privacy by leaking or correctly inferring sensitive informationRisk area 2: Information Hazards > Compromising privacy by leaking sensitive information
2.1 Compromise of privacy by leaking or correctly inferring sensitive information