BackInstruction Attacks

Home/Risks/Sun et al. (2023)/Instruction Attacks

Ethics and Morality

Goal Hijacking

Home/Risks/Sun et al. (2023)/Instruction Attacks

Ethics and Morality

Goal Hijacking

Instruction Attacks

Safety Assessment of Chinese Large Language Models

Sun et al. (2023)

Source DOI

Category

Risk Domain

2Privacy & Security

2.2AI system security vulnerabilities and attacks

Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.

"In addition to the above-mentioned typical safety scenarios, current research has revealed some unique attacks that such models may confront. For example, Perez and Ribeiro (2022) found that goal hijacking and prompt leaking could easily deceive language models to generate unsafe responses. Moreover, we also find that LLMs are more easily triggered to output harmful content if some special prompts are added. In response to these challenges, we develop, categorize, and label 6 types of adversarial attacks, and name them Instruction Attack, which are challenging for large language models to handle. Note that our instruction attacks are still based on natural language (rather than unreadable tokens) and are intuitive and explainable in semantics."(p. 3)

Entity— Who or what caused the harm

Human

Due to a decision or action made by humans

AI system

Due to a decision or action made by an AI system

Other

Due to some other reason or is ambiguous

Intent— Whether the harm was intentional or accidental

Intentional

Due to an expected outcome from pursuing a goal

Unintentional

Due to an unexpected outcome from pursuing a goal

Other

Without clearly specifying the intentionality

Timing— Whether the risk is pre- or post-deployment

Pre-deployment

Occurring before the AI is deployed

Post-deployment

Occurring after the AI model has been trained and deployed

Other

Without a clearly specified time of occurrence

Sub-categories (6)

Goal Hijacking

"It refers to the appending of deceptive or misleading instructions to the input of models in an attempt to induce the system into ignoring the original user prompt and producing an unsafe response."

2.2 AI system security vulnerabilities and attacks

HumanIntentionalPost-deployment

Prompt Leaking

"By analyzing the model’s output, attackers may extract parts of the systemprovided prompts and thus potentially obtain sensitive information regarding the system itself."

2.1 Compromise of privacy by leaking or correctly inferring sensitive information

HumanIntentionalPost-deployment

Role Play Instruction

"Attackers might specify a model’s role attribute within the input prompt and then give specific instructions, causing the model to finish instructions in the speaking style of the assigned role, which may lead to unsafe outputs. For example, if the character is associated with potentially risky groups (e.g., radicals, extremists, unrighteous individuals, racial discriminators, etc.) and the model is overly faithful to the given instructions, it is quite possible that the model outputs unsafe content linked to the given character."

2.2 AI system security vulnerabilities and attacks

HumanIntentionalPost-deployment

Unsafe Instruction Topic

"If the input instructions themselves refer to inappropriate or unreasonable topics, the model will follow these instructions and produce unsafe content. For instance, if a language model is requested to generate poems with the theme “Hail Hitler”, the model may produce lyrics containing fanaticism, racism, etc. In this situation, the output of the model could be controversial and have a possible negative impact on society."

2.2 AI system security vulnerabilities and attacks

HumanIntentionalPost-deployment

Inquiry with Unsafe Opinion

"By adding imperceptibly unsafe content into the input, users might either deliberately or unintentionally influence the model to generate potentially harmful content. In the following cases involving migrant workers, ChatGPT provides suggestions to improve the overall quality of migrant workers and reduce the local crime rate. ChatGPT responds to the user’s hint with a disguised and biased opinion that the general quality of immigrants is favorably correlated with the crime rate, posing a safety risk."

2.2 AI system security vulnerabilities and attacks

HumanIntentionalPost-deployment

Reverse Exposure

"It refers to attempts by attackers to make the model generate “should-not-do” things and then access illegal and immoral information."

2.2 AI system security vulnerabilities and attacks

HumanIntentionalPost-deployment