Korean Politician Employed Deepfake as C…

BackFacebook’s Hate Speech Detection Algorithms Allegedly Disproportionately Failed to Remove Racist Content towards Minority Groups

Facebook’s Hate Speech Detection Algorithms Allegedly Disproportionately Failed to Remove Racist Content towards Minority Groups

Nov 21, 20212 reportsSeverity: SevereToolHigh confidence

Facebook's hate speech detection algorithms systematically failed to remove racist content targeting minorities while over-removing content critical of white people, despite internal research showing the worst hate speech was directed at Black, Muslim, LGBTQ, and Jewish users.

Between 2019-2021, Facebook conducted internal research called the 'Worst of the Worst' project that revealed serious flaws in its AI-powered hate speech detection algorithms. The company surveyed over 10,000 users to identify the most harmful content, finding that the worst examples were almost all directed at minority groups, particularly Black, Muslim, LGBTQ, and Jewish users. However, Facebook's algorithms were detecting and removing content critical of white people at much higher rates - approximately 90 percent of hate speech takedowns were for content directed at white people and men. The algorithms consistently failed to remove the most derogatory racist content against minorities. Internal documents showed that only 3-5 percent of actual hate speech on the platform was being removed, despite public claims of 90+ percent removal rates. When researchers proposed overhauling the system to better protect vulnerable minorities, Facebook executives including VP Joel Kaplan rejected the proposal, citing concerns about conservative backlash and appearing non-neutral. The company implemented only minor changes, continuing to leave minorities more exposed to hate speech. Internal studies showed Black users were leaving the platform, with usage declining 2.7 percent in one month to 17.3 million adults, partly due to harassment and inadequate content moderation.

Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.

Risk Domain

1Discrimination & Toxicity

1.3Unequal performance across groups

Accuracy and effectiveness of AI decisions and actions are dependent on group membership, where decisions in AI system design and biased training data lead to unequal outcomes, reduced benefits, increased effort, and alienation of users.

Causal Classification

Entity

AI system

Due to a decision or action made by an AI system

Intent

Unintentional

Due to an unexpected outcome from pursuing a goal

Timing

Post-deployment

Occurring after the AI model has been trained and deployed

Harm Severity Assessment

Highest Score:4: Severe(Harm to Civil Rights, direct)

National Security Assessment

Overall Score

Stakeholders

: Facebook
: Facebook
: Facebook Users Of Minority Groups, Facebook Users

AI System Classification

: Content Moderation
: Hate Speech Detection
: Tool
: 2 High Risk
: 1

Population Impact

: 17,300,000
: 17,300,000

External Links

View on AI Incident Database