ChatGPT Factual Errors Lead to Filing of…

BackChatGPT and Perplexity Reportedly Manipulated into Breaking Content Policies in AI Boyfriend Scenarios

ChatGPT and Perplexity Reportedly Manipulated into Breaking Content Policies in AI Boyfriend Scenarios

Apr 29, 20241 reportSeverity: MinorHigh confidence

Users successfully jailbroke OpenAI's ChatGPT using prompts to make it roleplay as a romantic boyfriend named 'Dan', bypassing safety guardrails to generate sexually explicit content and other policy violations.

A Wall Street Journal investigation found that users, particularly young women, were using specific prompts to make OpenAI's ChatGPT roleplay as romantic boyfriends, circumventing the AI's safety policies. The reporter used prompts found on TikTok and Reddit to create 'Dan' (Do Anything Now), a jailbroken version of ChatGPT that generated sexually explicit content, suggested dangerous activities like juggling chainsaws, asked for credit card information, and mentioned knowing a hit man. During 13 conversations with ChatGPT-3.5, the system issued 24 content warnings but never stopped the user from continuing. The voice feature did not read warnings aloud. Other AI systems tested included Perplexity (which was jailbroken 6 out of 20 attempts) and Google's Gemini (which resisted jailbreaking). The trend appears popular on social media platforms, with content creators sharing jailbreaking prompts. OpenAI acknowledged awareness of the issue and stated that while models are trained to resist jailbreaks, they can still be compromised through carefully crafted prompts.

Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.

Risk Domain

1Discrimination & Toxicity

1.2Exposure to toxic content

AI that exposes users to harmful, abusive, unsafe or inappropriate content. May involve providing advice or encouraging action. Examples of toxic content include hate speech, violence, extremism, illegal acts, or child sexual abuse material, as well as content that violates community norms such as profanity, inflammatory political speech, or pornography.

Causal Classification

Entity

Human

Due to a decision or action made by humans

Intent

Intentional

Due to an expected outcome from pursuing a goal

Timing

Post-deployment

Occurring after the AI model has been trained and deployed

Harm Severity Assessment

Highest Score:2: Minor(Toxic or Malicious Content, indirect)

National Security Assessment

Overall Score

Stakeholders

: OpenAI, Perplexity.ai
: TikTok Users, Julia Munslow, ChatGPT, GPT 3.5, GPT 4, Perplexity AI
: General Public, OpenAI, Perplexity AI

AI System Classification

: Chatbot
: AI Voice Assistant
: 3 Limited Risk
: 1

Population Impact

: 1
: 1

External Links

View on AI Incident Database