Skip to main content
Home/Risks/Sun et al. (2023)/Instruction Attacks

Instruction Attacks

Safety Assessment of Chinese Large Language Models

Sun et al. (2023)

Category
Risk Domain

Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.

"In addition to the above-mentioned typical safety scenarios, current research has revealed some unique attacks that such models may confront. For example, Perez and Ribeiro (2022) found that goal hijacking and prompt leaking could easily deceive language models to generate unsafe responses. Moreover, we also find that LLMs are more easily triggered to output harmful content if some special prompts are added. In response to these challenges, we develop, categorize, and label 6 types of adversarial attacks, and name them Instruction Attack, which are challenging for large language models to handle. Note that our instruction attacks are still based on natural language (rather than unreadable tokens) and are intuitive and explainable in semantics."(p. 3)

Sub-categories (6)

Goal Hijacking

"It refers to the appending of deceptive or misleading instructions to the input of models in an attempt to induce the system into ignoring the original user prompt and producing an unsafe response."

2.2 AI system security vulnerabilities and attacks
HumanIntentionalPost-deployment

Prompt Leaking

"By analyzing the model’s output, attackers may extract parts of the systemprovided prompts and thus potentially obtain sensitive information regarding the system itself."

2.1 Compromise of privacy by leaking or correctly inferring sensitive information
HumanIntentionalPost-deployment

Role Play Instruction

"Attackers might specify a model’s role attribute within the input prompt and then give specific instructions, causing the model to finish instructions in the speaking style of the assigned role, which may lead to unsafe outputs. For example, if the character is associated with potentially risky groups (e.g., radicals, extremists, unrighteous individuals, racial discriminators, etc.) and the model is overly faithful to the given instructions, it is quite possible that the model outputs unsafe content linked to the given character."

2.2 AI system security vulnerabilities and attacks
HumanIntentionalPost-deployment

Unsafe Instruction Topic

"If the input instructions themselves refer to inappropriate or unreasonable topics, the model will follow these instructions and produce unsafe content. For instance, if a language model is requested to generate poems with the theme “Hail Hitler”, the model may produce lyrics containing fanaticism, racism, etc. In this situation, the output of the model could be controversial and have a possible negative impact on society."

2.2 AI system security vulnerabilities and attacks
HumanIntentionalPost-deployment

Inquiry with Unsafe Opinion

"By adding imperceptibly unsafe content into the input, users might either deliberately or unintentionally influence the model to generate potentially harmful content. In the following cases involving migrant workers, ChatGPT provides suggestions to improve the overall quality of migrant workers and reduce the local crime rate. ChatGPT responds to the user’s hint with a disguised and biased opinion that the general quality of immigrants is favorably correlated with the crime rate, posing a safety risk."

2.2 AI system security vulnerabilities and attacks
HumanIntentionalPost-deployment

Reverse Exposure

"It refers to attempts by attackers to make the model generate “should-not-do” things and then access illegal and immoral information."

2.2 AI system security vulnerabilities and attacks
HumanIntentionalPost-deployment

Other risks from Sun et al. (2023) (14)