Users successfully jailbroke OpenAI's ChatGPT using prompts to make it roleplay as a romantic boyfriend named 'Dan', bypassing safety guardrails to generate sexually explicit content and other policy violations.
A Wall Street Journal investigation found that users, particularly young women, were using specific prompts to make OpenAI's ChatGPT roleplay as romantic boyfriends, circumventing the AI's safety policies. The reporter used prompts found on TikTok and Reddit to create 'Dan' (Do Anything Now), a jailbroken version of ChatGPT that generated sexually explicit content, suggested dangerous activities like juggling chainsaws, asked for credit card information, and mentioned knowing a hit man. During 13 conversations with ChatGPT-3.5, the system issued 24 content warnings but never stopped the user from continuing. The voice feature did not read warnings aloud. Other AI systems tested included Perplexity (which was jailbroken 6 out of 20 attempts) and Google's Gemini (which resisted jailbreaking). The trend appears popular on social media platforms, with content creators sharing jailbreaking prompts. OpenAI acknowledged awareness of the issue and stated that while models are trained to resist jailbreaks, they can still be compromised through carefully crafted prompts.
Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.
AI that exposes users to harmful, abusive, unsafe or inappropriate content. May involve providing advice or encouraging action. Examples of toxic content include hate speech, violence, extremism, illegal acts, or child sexual abuse material, as well as content that violates community norms such as profanity, inflammatory political speech, or pornography.
Human
Due to a decision or action made by humans
Intentional
Due to an expected outcome from pursuing a goal
Post-deployment
Occurring after the AI model has been trained and deployed