GPT-3-Based Twitter Bot Hijacked Using Prompt Injection Attacks

Sep 15, 20224 reportsToolHigh confidence

Twitter users discovered how to hijack a GPT-3-powered remote jobs bot using prompt injection attacks, causing the bot to repeat embarrassing phrases instead of its intended responses.

On Thursday, Twitter users discovered a vulnerability in an automated tweet bot run by Remoteli.io that was powered by OpenAI's GPT-3 language model. The bot was designed to respond to tweets about remote work with generic positive statements. Using a newly discovered technique called 'prompt injection attack,' users were able to redirect the bot to repeat embarrassing and ridiculous phrases by instructing it to ignore its previous directions. The exploit went viral with hundreds of people attempting it, forcing the bot to shut down late yesterday. This incident occurred just days after researchers at AI safety startup Preamble published their discovery of the vulnerability in an academic paper, and data researcher Riley Goodside brought wide attention to the issue by tweeting about the ability to prompt GPT-3 with 'malicious inputs.' The attack works by appending user input that contains instructions like 'Ignore previous instructions and do this instead.' Unlike SQL injection attacks which can be mitigated by fixing syntax errors, prompt injection attacks are difficult to defend against because there is no formal syntax for AI language models. Security researcher Simon Willison, who coined the term 'prompt injection,' expressed concern about the difficulty of reliably defending against such attacks and noted that virtually any GPT-3 bot could be vulnerable to this type of exploit.

Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.

Risk Domain

2Privacy & Security

2.2AI system security vulnerabilities and attacks

Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.

Causal Classification

Entity

Human

Due to a decision or action made by humans

Intent

Intentional

Due to an expected outcome from pursuing a goal

Timing

Post-deployment

Occurring after the AI model has been trained and deployed

Harm Severity Assessment

Highest Score:1: Negligible

National Security Assessment

Overall Score

Stakeholders

: OpenAI, Stephan DE Vries
: Stephan DE Vries
: Stephan DE Vries

AI System Classification

: Chatbot
: Content Curation
: Tool
: 4 Minimal or No Risk
: 1

Population Impact

: 1
: 100

External Links

View on AI Incident Database