Personal voice assistants struggle with black voices, new study shows

Mar 23, 20202 reportsSeverity: MinorToolHigh confidence

Five major commercial automated speech recognition (ASR) systems from Amazon, Apple, Google, IBM, and Microsoft exhibited substantial racial disparities, with average word error rates of 35% for Black speakers compared to 19% for white speakers.

Researchers from Stanford analyzed five commercial automated speech recognition systems developed by Amazon, Apple, Google, IBM, and Microsoft using audio from interviews with 42 white speakers and 73 Black speakers across five US cities. The systems were tested on 2,141 matched audio snippets from each group, totaling 19.8 hours of audio. All five ASR systems exhibited substantial racial disparities in performance, with an average word error rate of 35% for Black speakers compared to 19% for white speakers. Microsoft's system performed best overall while Apple's performed worst, but error rates for Black speakers were nearly twice as large across all systems. The disparities were particularly pronounced for Black men, who had a 41% error rate compared to 30% for Black women. The study found that 23% of audio snippets from Black speakers resulted in unusable transcripts (error rate above 50%) compared to only 1.6% for white speakers. The researchers traced these disparities to acoustic models rather than language models, suggesting the systems struggle with phonological and prosodic characteristics of African American Vernacular English rather than vocabulary or grammar differences.

Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.

Risk Domain

1Discrimination & Toxicity

1.3Unequal performance across groups

Accuracy and effectiveness of AI decisions and actions are dependent on group membership, where decisions in AI system design and biased training data lead to unequal outcomes, reduced benefits, increased effort, and alienation of users.

Causal Classification

Entity

AI system

Due to a decision or action made by an AI system

Intent

Unintentional

Due to an unexpected outcome from pursuing a goal

Timing

Post-deployment

Occurring after the AI model has been trained and deployed

Harm Severity Assessment

Highest Score:2: Minor(Differential Treatment, direct)

National Security Assessment

Overall Score

Stakeholders

: Microsoft, IBM, Google, Apple, Amazon
: Microsoft, IBM, Google, Apple, Amazon
: Black People

AI System Classification

: Voice Recognition
: Tool
: 2 High Risk
: 1

Population Impact

: 115
: 115

External Links

View on AI Incident Database