Skip to main content
BackExploiting Limited Generalization of Safety Finetuning
Home/Risks/Anwar et al. (2024)/Exploiting Limited Generalization of Safety Finetuning

Exploiting Limited Generalization of Safety Finetuning

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Anwar et al. (2024)

Sub-category
Risk Domain

Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.

"Safety tuning is performed over a much narrower distribution compared to the pretraining distribution. This leaves the model vulnerable to attacks that exploit gaps in the generalization of the safety training, e.g. using encoded text (Wei et al., 2023c) or low-resource languages (Deng et al., 2023a; Yong et al., 2023) (see also Section 3.2)."(p. 69)

Part of Vulnerability to Poisoning and Backdoors

Other risks from Anwar et al. (2024) (26)