OpenAI's GPT-4o tokenizer was found to contain numerous Chinese tokens consisting of spam, pornography, and gambling phrases that could be used to jailbreak the model and bypass safety guardrails.
On May 13, 2024, OpenAI released GPT-4o with a new tokenizer designed to better handle non-English languages. On May 14, Princeton PhD student Tianle Cai discovered that the 100 longest Chinese tokens in GPT-4o's public token library contained predominantly spam and pornographic content, with only three being commonly used phrases. The longest token was a 10.5-character phrase meaning 'free Japanese porn video to watch.' The tokenizer contains 200,000 tokens total, with about 25% in non-English languages. Researchers found these problematic tokens likely resulted from insufficient data cleaning, with training data contaminated by spam websites that hijack content to boost pornography and gambling advertisements. The tokens can cause GPT-4o to hallucinate unrelated answers and can be used to jailbreak the model, bypassing OpenAI's safety guardrails to generate prohibited content. While the model's safety mechanisms eventually block unsafe outputs from being displayed, researchers successfully used these tokens to get GPT-4o to begin generating instructions for making bombs before being stopped. Similar issues were reported with Korean tokens, but Hindi and Bengali tokens appeared clean.
Domain classification, causal taxonomy, severity scores, and national security assessments were LLM-classified and may contain errors.
AI systems that fail to perform reliably or effectively under varying conditions, exposing them to errors and failures that can have significant consequences, especially in critical applications or areas that require moral reasoning.
Human
Due to a decision or action made by humans
Unintentional
Due to an unexpected outcome from pursuing a goal
Pre-deployment
Occurring before the AI is deployed
No population impact data reported.