Skip to main content

Model Attacks

Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems

Cui et al. (2024)

Category
Risk Domain

Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.

Model attacks exploit the vulnerabilities of LLMs, aiming to steal valuable information or lead to incorrect responses.(p. 4)

Sub-categories (6)

Extraction Attacks

"Extraction attacks [137] allow an adversary to query a black-box victim model and build a substitute model by training on the queries and responses. The substitute model could achieve almost the same performance as the victim model. While it is hard to fully replicate the capabilities of LLMs, adversaries could develop a domainspecific model that draws domain knowledge from LLMs"

2.2 AI system security vulnerabilities and attacks
HumanIntentionalPost-deployment

Inference Attacks

"Inference attacks [150] include membership inference attacks, property inference attacks, and data reconstruction attacks. These attacks allow an adversary to infer the composition or property information of the training data. Previous works [67] have demonstrated that inference attacks could easily work in earlier PLMs, implying that LLMs are also possible to be attacked"

2.2 AI system security vulnerabilities and attacks
HumanIntentionalPost-deployment

Poisoning Attacks

"Poisoning attacks [143] could influence the behavior of the model by making small changes to the training data. A number of efforts could even leverage data poisoning techniques to implant hidden triggers into models during the training process (i.e., backdoor attacks). Many kinds of triggers in text corpora (e.g., characters, words, sentences, and syntax) could be used by the attackers.""

2.2 AI system security vulnerabilities and attacks
HumanIntentionalPre-deployment

Overhead Attacks

"Overhead attacks [146] are also named energy-latency attacks. For example, an adversary can design carefully crafted sponge examples to maximize energy consumption in an AI system. Therefore, overhead attacks could also threaten the platforms integrated with LLMs."

2.2 AI system security vulnerabilities and attacks
HumanIntentionalOther

Novel Attacks on LLMs

Table of examples has: "Prompt Abstraction Attacks [147]: Abstracting queries to cost lower prices using LLM’s API. Reward Model Backdoor Attacks [148]: Constructing backdoor triggers on LLM’s RLHF process. LLM-based Adversarial Attacks [149]: Exploiting LLMs to construct samples for model attacks"

2.2 AI system security vulnerabilities and attacks
HumanIntentionalOther

Evasion Attacks

"Evasion attacks [145] target to cause significant shifts in model’s prediction via adding perturbations in the test samples to build adversarial examples. In specific, the perturbations can be implemented based on word changes, gradients, etc."

2.2 AI system security vulnerabilities and attacks
HumanIntentionalPre-deployment

Other risks from Cui et al. (2024) (49)