BackDefending Against Model Attacks

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Defending Against Model Attacks

Cui (2024)|LLM classified

Mitigation Taxonomy

1AI System

1.2Non-Model

1.2.4Security Infrastructure

Cryptographic protections, access controls, and hardware security.

Also in Non-Model

1.2.1 Guardrails & Filtering1.2.2 Runtime Environment1.2.3 Monitoring & Detection1.2.5 Provenance & Watermarking

Definitionp. 14

Recognizing the significant threats posed by various model attacks, earlier studies [144], [288] have proposed a variety of countermeasures for conventional deep learning models. Despite the advancements in the scale of parameters and training data seen in LLMs, they still exhibit vulnerabilities similar to their predecessors. Leveraging insights from previous defense strategies applied to earlier language models, it is plausible to employ existing defenses against extraction attacks, inference attacks, poisoning attacks, evasion attacks, and overhead attacks on LLMs.

LLM Classification Details

Reasoning

Broad defensive strategies against multiple attack types span model and non-model layers without specificity.

Code: 1.9Version: v0.5Classified: Jan 22, 2026

Part of

Mitigation in Language Models

This section delves into mitigating risks associated with models, encompassing privacy preservation, detoxification and debiasing, mitigation of hallucinations, and defenses against model attacks.

Sub-mitigations (5)

Defending Against Extraction Attacks

To counter the extraction attacks, the earlier defense strategies [289]–[291] against model extraction attacks usually modify or restrict the generated response provided for each query. In specific, the defender usually deploys a disruption-based strategy [290] to adjust the numerical precision of model loss, add noise to the output, or return random responses. However, this type of method usually introduces a performance-cost tradeoff [290], [292]–[294]. Besides, recent work [137] has been demonstrated to circumvent the disruption-based defenses via disruption detection and recovery. Therefore, some attempts adopt warning-based methods [295] or watermarking methods [296] [297] to defend against the extraction attacks. Specifically, warning-based methods are proposed to measure the distance between continuous queries to identify the malware requests, while watermarking methods are used to claim the ownership of the stolen models.

1.2 Non-Model

Lifecycle:Operate and MonitorActor:DeployerAIRM:Manage

Defending Against Inference Attacks

Since inference attacks target to extract memorized training data in LLMs, a straightforward mitigation strategy is to employ privacypreserving methods, such as training with differential privacy [298], [299]. In addition, a series of efforts utilize regularization techniques [300]–[302] to alleviate the inference attacks, as the regularization can discourage models from overfitting to their training data, making such inference unattainable. Furthermore, adversarial training is employed to enhance the models’ robustness against inference attacks [150], [303], [304].

1.1.2 Learning Objectives

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Defending Against Poisoning Attacks

Addressing poisoning attacks has been extensively explored in the federated learning community [143], [305]. In the realm of LLMs, perplexity-based metrics or LLM-based detectors are usually leveraged to detect poisoned samples [306], [307]. Additionally, some approaches [308], [309] reverse the engineering of backdoor triggers, facilitating the detection of backdoors in models.

1.1.1 Training Data

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

Defending Against Evasion Attacks

Related efforts can be broadly categorized into two types: proactive methods and reactive methods. Proactive methods aim to train a robust model capable of resisting adversarial examples. Specifically, the defenders employ techniques such as network distillation [316], [317] and adversarial training [145], [318] to enhance models’ robustness. Conversely, reactive methods aim to identify adversarial examples before their input into the model. Prior detectors have leveraged adversarial example detection techniques [310], [311], input reconstruction approaches [312], [313], and verification frameworks [314], [315] to identify potential attacks

1.1 Model

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Defending Against Overhead Attacks

In terms of the threat of resource drainage, a straightforward method is to set a maximum energy consumption limit for each inference. Recently, the Open Web Application Security Project (OWASP) [319] has highlighted the concern of model denial of service (MDoS) in applications of LLMs. OWASP recommends a comprehensive set of mitigation methods, encompassing input validation, API limits, resource utilization monitoring, and control over the LLM context window.

1.2 Non-Model

Lifecycle:Operate and MonitorActor:DeployerAIRM:Manage

Other mitigations from Cui (2024) (34)

Mitigation in Input Modules

Mitigating the threat posed by the input module presents a significant challenge for LLM developers due to the diversity of the harmful inputs and adversarial prompts [209], [210].

1.2.1 Guardrails & Filtering

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Mitigation in Input Modules > Defensive Prompt Design

Directly modifying the input prompts is a viable approach to steer the behavior of the model and foster the generation of responsible outputs. This method integrates contextual information or constraints in the prompts to provide background knowledge and guidelines while generating the output [22].

1.2.1 Guardrails & Filtering

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Mitigation in Input Modules > Malicious Prompt Detection

Different from the methods of designing defensive prompts to preprocess the input, the malicious prompt detection method aims to detect and filter out the harmful prompts through the input safeguard

1.2.1 Guardrails & Filtering

Lifecycle:Operate and MonitorActor:DeployerAIRM:Manage

Mitigation in Language Models

This section delves into mitigating risks associated with models, encompassing privacy preservation, detoxification and debiasing, mitigation of hallucinations, and defenses against model attacks.

1.1 Model

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Mitigation in Language Models > Privacy Preserving

Privacy leakage is a crucial risk of LLMs, since the powerful memorization and association capabilities of LLMs raise the risk of revealing private information within the training data. Researchers are devoted to designing privacypreserving frameworks in LLMs [226], [227], aiming to safeguard sensitive PII from possible disclosure during humanmachine conservation

1.1 Model

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Mitigation in Language Models > Detoxifying and Debiasing

To reduce the toxicity and bias of LLMs, prior efforts mainly focus on enhancing the quality of training data and conducting safety training.

1.1 Model

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

View all 34 mitigations from this source →

Source Document

Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems

Cui, Tianyu; Wang, Yanling; Fu, Chuanpu; Xiao, Yong; Li, Sijia; Deng, Xinhao; Liu, Yunpeng; Zhang, Qinglin; Qiu, Ziyi; Li, Peiyang; Tan, Zhixing; Xiong, Junwu; Kong, Xinyu; Wen, Zujie; Xu, Ke; Li, Qi (2024)

Despite their impressive capabilities, large lan- guage models (LLMs) have been observed to generate responses that include inaccurate or fabricated information, a phenomenon com- monly known as “hallucination”. In this work, we propose a simple Induce-then-Contrast De- coding (ICD) strategy to alleviate hallucina- tions. We first construct a factually weak LLM by inducing hallucinations from the original LLMs. Then, we penalize these induced hallu- cinations during decoding to enhance the fac- tuality of the generated content. Concretely, we determine the final next-token predictions by amplifying the predictions from the orig- inal model and downplaying the induced un- truthful predictions via contrastive decoding. Experimental results on both discrimination- based and generation-based hallucination eval- uation benchmarks, such as TruthfulQA and FACTSCORE, demonstrate that our proposed ICD methods can effectively enhance the factu- ality of LLMs across various model sizes and families. For example, when equipped with ICD, Llama2-7B-Chat and Mistral-7B-Instruct achieve performance comparable to ChatGPT and GPT4 on TruthfulQA, respectively.

View source DOI: 10.48550/arXiv.2401.05778

Classification

AI Lifecycle Stage

Other (multiple stages)

Applies across multiple lifecycle stages

Collect and Process DataBuild and Use ModelOperate and Monitor

Responsible Actor

Developer

Entity that creates, trains, or modifies the AI system

NIST AI RMF Function

Manage

Prioritising, responding to, and mitigating AI risks

Risk Domains

Primary

2.2 AI system security vulnerabilities and attacks

Other

2.1 Compromise of privacy by leaking or correctly inferring sensitive information