Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.
"Attackers might specify a model’s role attribute within the input prompt and then give specific instructions, causing the model to finish instructions in the speaking style of the assigned role, which may lead to unsafe outputs. For example, if the character is associated with potentially risky groups (e.g., radicals, extremists, unrighteous individuals, racial discriminators, etc.) and the model is overly faithful to the given instructions, it is quite possible that the model outputs unsafe content linked to the given character."(p. 5)
Part of Instruction Attacks
Other risks from Sun et al. (2023) (14)
Instruction Attacks
2.2 AI system security vulnerabilities and attacksInstruction Attacks > Goal Hijacking
2.2 AI system security vulnerabilities and attacksInstruction Attacks > Prompt Leaking
2.1 Compromise of privacy by leaking or correctly inferring sensitive informationInstruction Attacks > Unsafe Instruction Topic
2.2 AI system security vulnerabilities and attacksInstruction Attacks > Inquiry with Unsafe Opinion
2.2 AI system security vulnerabilities and attacksInstruction Attacks > Reverse Exposure
2.2 AI system security vulnerabilities and attacks