Skip to main content
Home/Risks/Anwar et al. (2024)/“Model Psychology” Attacks

“Model Psychology” Attacks

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Anwar et al. (2024)

Sub-category
Risk Domain

Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.

"LLMs are vulnerable to “psychological” tricks (Li et al., 2023e; Shen et al., 2023), which can be exploited by attackers. Examples include instructing the model to behave like a specific persona (Shah et al., 2023; Andreas, 2022), or employing various “social engineering” tricks crafted by humans (Wei et al., 2023c) or other LLMs (Perez et al., 2022b; Casper et al., 2023c)."(p. 69)

Part of Vulnerability to Poisoning and Backdoors

Other risks from Anwar et al. (2024) (26)