Skip to main content

Prompt Attacks

Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models’ Alignment

Liu et al. (2024)

Sub-category
Risk Domain

Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.

carefully controlled adversarial perturbation can flip a GPT model’s answer when used to classify text inputs. Furthermore, we find that by twisting the prompting question in a certain way, one can solicit dangerous information that the model chose to not answer(p. 26)

Part of Robustness

Other risks from Liu et al. (2024) (34)