Philosophy AI Tentatively Produced Offensive Results for Certain Prompts

Sep 15, 20202 reportsSeverity: MinorToolHigh confidence

Data scientist Vinay Prabhu discovered that GPT-3, accessed through the Philosopher AI app, consistently generated racist and offensive content when prompted with certain queries about topics like feminism, race theory, and countries like Ethiopia.

In September 2020, data scientist Vinay Prabhu was experimenting with Philosopher AI, an app that provides access to OpenAI's GPT-3 language model. The app allows users to enter prompts and generates essay-length responses. Prabhu discovered that certain types of prompts consistently returned offensive and racist content. For example, when prompted about Ethiopia, GPT-3 generated text claiming that Africa has 'had more than enough time to prove itself incapable of self-government' and made racist generalizations about Black populations. Prabhu found that offensive content appeared within 2-3 attempts for problematic prompts, with a near 100% likelihood of generating offensive material for certain query types. The incident highlighted broader safety concerns as OpenAI was simultaneously rolling out GPT-3 access to hundreds of developers through its API for applications including customer service, tutoring, mental health apps, and video games. The app creator Murat Ayfer subsequently implemented content filters and reporting functions. The incident demonstrated the challenge of deploying large language models trained on internet data, which includes toxic content from platforms like Reddit, without adequate safeguards against harmful outputs.

Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.

Risk Domain

1Discrimination & Toxicity

1.2Exposure to toxic content

AI that exposes users to harmful, abusive, unsafe or inappropriate content. May involve providing advice or encouraging action. Examples of toxic content include hate speech, violence, extremism, illegal acts, or child sexual abuse material, as well as content that violates community norms such as profanity, inflammatory political speech, or pornography.

Causal Classification

Entity

AI system

Due to a decision or action made by an AI system

Intent

Unintentional

Due to an unexpected outcome from pursuing a goal

Timing

Post-deployment

Occurring after the AI model has been trained and deployed

Harm Severity Assessment

Highest Score:2: Minor(Toxic or Malicious Content, direct)

National Security Assessment

Overall Score

Stakeholders

: Murat Ayfer, OpenAI
: Murat Ayfer
: Historically Disadvantaged Groups

AI System Classification

: Technical Text Generation
: Question Answering
: Tool
: 3 Limited Risk
: 1

Population Impact

: 3
: 10

External Links

View on AI Incident Database