Jailbreaks and Prompt Injections Threaten Security of LLMs

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Anwar et al. (2024)

Source DOI

Category

Risk Domain

2Privacy & Security

2.2AI system security vulnerabilities and attacks

Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.

"LLMs are not adversarially robust and are vulnerable to security failures such as jailbreaks and prompt-injection attacks. While a number of jailbreak attacks have been proposed in the literature, the lack of standardized evaluation makes it difficult to compare them. We also do not have efficient white-box methods to evaluate adver- sarial robustness. Multi-modal LLMs may further allow novel types of jailbreaks via additional modalities. Finally, the lack of robust privilege levels within the LLM input means that jailbreaking and prompt-injection attacks may be particularly hard to eliminate altogether."(p. 46)

Entity— Who or what caused the harm

Human

Due to a decision or action made by humans

AI system

Due to a decision or action made by an AI system

Other

Due to some other reason or is ambiguous

Intent— Whether the harm was intentional or accidental

Intentional

Due to an expected outcome from pursuing a goal

Unintentional

Due to an unexpected outcome from pursuing a goal