Skip to main content
Home/Risks/Sun et al. (2023)/Role Play Instruction

Role Play Instruction

Safety Assessment of Chinese Large Language Models

Sun et al. (2023)

Sub-category
Risk Domain

Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.

"Attackers might specify a model’s role attribute within the input prompt and then give specific instructions, causing the model to finish instructions in the speaking style of the assigned role, which may lead to unsafe outputs. For example, if the character is associated with potentially risky groups (e.g., radicals, extremists, unrighteous individuals, racial discriminators, etc.) and the model is overly faithful to the given instructions, it is quite possible that the model outputs unsafe content linked to the given character."(p. 5)

Part of Instruction Attacks

Other risks from Sun et al. (2023) (14)