Year-long AI Surveillance Pilot in Two S…

BackFacebook's Automated Tools Failed to Adequately Remove Hate Speech, Violence, and Incitement

Facebook's Automated Tools Failed to Adequately Remove Hate Speech, Violence, and Incitement

Mar 1, 20211 reportSeverity: SevereToolHigh confidence

Facebook's AI moderation tools were found to remove posts accounting for only 3-5% of hate speech views and 0.6% of violence and incitement views, falling far short of the company's public claims about AI effectiveness.

Facebook deployed AI-powered automated moderation tools to detect and remove hate speech and violent content from its platform. Internal documents from March revealed that these AI systems were removing posts that accounted for only 3-5% of views of hate speech and 0.6% of views of violence and incitement, despite CEO Mark Zuckerberg's 2018 claim that the company expected to train systems to 'proactively detect the vast majority of problematic content' by the end of 2019. The company's senior engineers acknowledged in 2019 that they might hit a ceiling beyond which further advances would be difficult, with estimates suggesting it would be very difficult to improve beyond 10-20% detection rates in the short-medium term. Facebook's AI tools made significant errors, mistakenly flagging cockfights as car crashes and mass shooting videos as paintball games or carwash trips. The situation was worse in non-English speaking countries, with Facebook estimating it identified just 0.23% of hate speech in Afghanistan due to lacking dictionaries of slurs in local languages. Despite user surveys indicating preference for more aggressive enforcement even at the cost of false positives, Facebook leadership remained more concerned with avoiding over-removal, leading engineers to train models that avoid false positives while letting more hate speech through undetected.

Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.

Risk Domain

7AI System Safety, Failures & Limitations

7.3Lack of capability or robustness

AI systems that fail to perform reliably or effectively under varying conditions, exposing them to errors and failures that can have significant consequences, especially in critical applications or areas that require moral reasoning.

Causal Classification

Entity

AI system

Due to a decision or action made by an AI system

Intent

Unintentional

Due to an unexpected outcome from pursuing a goal

Timing

Post-deployment

Occurring after the AI model has been trained and deployed

Harm Severity Assessment

Highest Score:4: Severe(Financial Loss, direct)

National Security Assessment

Overall Score

Stakeholders

: Facebook
: Facebook
: Facebook Users

AI System Classification

: Content Moderation
: Hate Speech Detection
: Tool
: 3 Limited Risk
: 1

Population Impact

No population impact data reported.

External Links

View on AI Incident Database