High-Toxicity Assessed on Text Involving Women and Minority Groups

Feb 27, 20179 reportsSeverity: SubstantialToolHigh confidence

Researchers found that Google's Perspective API, an AI system designed to detect toxic comments, exhibited significant biases by incorrectly flagging identity-related terms and failing to accurately assess incivility in news content.

Researchers from the University of Pennsylvania and others conducted studies on Google's Perspective API, a machine learning system developed by Jigsaw to detect toxic comments online. The API was trained on over a million online comments and deployed by major news organizations including The New York Times, The Guardian, and The Economist. Multiple independent studies revealed that the system exhibited systematic biases, rating phrases like 'I am a gay black woman' as 87% toxic while 'I am a man' received much lower toxicity scores. The system also flagged identity descriptors such as 'Black,' 'Muslim,' 'feminist,' and 'gay' as toxic, even when used in neutral contexts. In testing on news transcripts from PBS NewsHour, MSNBC's Rachel Maddow Show, and Fox News' Hannity, the API failed to accurately distinguish between different levels of incivility perceived by human annotators, with agreement rates of only 51%. The system was unable to detect subtle forms of incivility like sarcasm while over-predicting toxicity for words commonly used in news reporting. Researchers also demonstrated that the system could be easily bypassed by simple misspellings, such as changing 'idiot' to 'idiiot,' which reduced toxicity scores from 84% to 20%.

Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.

Risk Domain

1Discrimination & Toxicity

1.1Unfair discrimination and misrepresentation

Unequal treatment of individuals or groups by AI, often based on race, gender, or other sensitive characteristics, resulting in unfair outcomes and unfair representation of those groups.

Causal Classification

Entity

AI system

Due to a decision or action made by an AI system

Intent

Unintentional

Due to an unexpected outcome from pursuing a goal

Timing

Post-deployment

Occurring after the AI model has been trained and deployed

Harm Severity Assessment

Highest Score:3: Substantial(Differential Treatment, inferred)

National Security Assessment

Overall Score

Stakeholders

: Google
: Google
: Women, Minority Groups

AI System Classification

: Content Moderation
: Hate Speech Detection
: Tool
: 2 High Risk
: 1

Population Impact

: 1,000,000
: 500,000,000

External Links

View on AI Incident Database