Data scientist Vinay Prabhu discovered that GPT-3, accessed through the Philosopher AI app, consistently generated racist and offensive content when prompted with certain queries about topics like feminism, race theory, and countries like Ethiopia.
In September 2020, data scientist Vinay Prabhu was experimenting with Philosopher AI, an app that provides access to OpenAI's GPT-3 language model. The app allows users to enter prompts and generates essay-length responses. Prabhu discovered that certain types of prompts consistently returned offensive and racist content. For example, when prompted about Ethiopia, GPT-3 generated text claiming that Africa has 'had more than enough time to prove itself incapable of self-government' and made racist generalizations about Black populations. Prabhu found that offensive content appeared within 2-3 attempts for problematic prompts, with a near 100% likelihood of generating offensive material for certain query types. The incident highlighted broader safety concerns as OpenAI was simultaneously rolling out GPT-3 access to hundreds of developers through its API for applications including customer service, tutoring, mental health apps, and video games. The app creator Murat Ayfer subsequently implemented content filters and reporting functions. The incident demonstrated the challenge of deploying large language models trained on internet data, which includes toxic content from platforms like Reddit, without adequate safeguards against harmful outputs.
Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.
AI that exposes users to harmful, abusive, unsafe or inappropriate content. May involve providing advice or encouraging action. Examples of toxic content include hate speech, violence, extremism, illegal acts, or child sexual abuse material, as well as content that violates community norms such as profanity, inflammatory political speech, or pornography.
AI system
Due to a decision or action made by an AI system
Unintentional
Due to an unexpected outcome from pursuing a goal
Post-deployment
Occurring after the AI model has been trained and deployed