BackMitigation in Toolchain Modules

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Mitigation in Toolchain Modules

Cui (2024)|LLM classified

Mitigation Taxonomy

2Organisation

2.4Engineering & Development

Practices governing how AI systems are designed and built, including research norms, development workflows, review processes, and engineering standards.

Also in Organisation

2.1 Oversight & Accountability2.2 Risk & Assurance2.3 Operations & Security

Definitionp. 15

Existing studies have designed methods to alleviate the security issues of tools in the lifecycle of LLMs. In this section, we summarize the mitigations of those issues according to the categories of tools

LLM Classification Details

Reasoning

Description summarizes mitigations abstractly without specifying concrete mechanisms, tools, or implementation approaches necessary for classification.

Code: 99.9Version: v0.5Classified: Jan 22, 2026

Sub-mitigations (3)

Defenses for Software Development Tools

Most existing vulnerabilities in programming languages, deep learning frameworks, and pre-processing tools, aim to hijack control flows. Therefore, control-flow integrity (CFI), which ensures that the control flows follow a predefined set of rules, can prevent the exploitation of these vulnerabilities. However, CFI solutions incur high overheads when applied to large-scale software such as LLMs [320], [321]. To tackle this issue, a low-precision version of CFI was proposed to reduce overheads [322]. Hardware optimizations are proposed to improve the efficiency of CFI [323]. In addition, it is critical to analyze and prevent security accidents in the environments of LLMs developing and deploying. We argue that data provenance analysis tools can be leveraged to forensic security issues [324]–[327] and detect attacks against LLM actively [328]–[330]. The key concept of data provenance revolves around the provenance graph, which is constructed based on audit systems. Specifically, the vertices in the graph represent file descriptors, e.g., files, sockets, and devices. Meanwhile, the edges depict the relationships between these file descriptors, such as system calls.

1.2.4 Security Infrastructure

Lifecycle:Other (multiple stages)Actor:Infrastructure ProviderAIRM:Manage

Defenses for LLM Hardware Systems

For memory attacks, many existing defenses against manipulating DNN inferences via memory corruption are based on error correction [160], [167], whereas incurring high overheads [168]. In contrast, some studies aim to revise DNN architectures, making it hard for attackers to launch memory-based attacks, e.g., Aegis [169]. For network-based attacks, which disrupt the communication between GPU machines, existing traffic detection systems can identify these attacks.

1.2 Non-Model

Lifecycle:Other (stage not listed)Actor:Infrastructure ProviderAIRM:Manage

Defenses for External Tools

It is difficult to eliminate risks introduced by external tools. The most straightforward and efficient approach is ensuring that only trusted tools are used, but it will impose limitations on the range of usages. Moreover, employing multiple tools (e.g., VirusTotal [350]) and aggregation techniques [351] can reduce the attack surfaces. For injection attacks, it will be helpful to implement strict input validation and sanitization [352] for any data received from external tools. Additionally, isolating the execution environment and applying the principle of least privilege can limit the impact of attacks [353]. For privacy issues, data sanitization methods can detect and remove sensitive information during the interaction between LLMs and external tools. For example, automatic unsupervised document sanitization can be performed using the information theory and knowledge bases [354]. Exsense [355] uses the BERT model to detect sensitive information from unstructured data. The similarities between word embeddings of sensitive entities and words in documents can be used to detect and anonymize sensitive information [356]. Besides, designing and enforcing ethical guidelines for external API usage can mitigate the risk of prompt injection and data leakage [357]

1.2 Non-Model

Lifecycle:Operate and MonitorActor:DeployerAIRM:Manage

Other mitigations from Cui (2024) (34)

Mitigation in Input Modules

Mitigating the threat posed by the input module presents a significant challenge for LLM developers due to the diversity of the harmful inputs and adversarial prompts [209], [210].

1.2.1 Guardrails & Filtering

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Mitigation in Input Modules > Defensive Prompt Design

Directly modifying the input prompts is a viable approach to steer the behavior of the model and foster the generation of responsible outputs. This method integrates contextual information or constraints in the prompts to provide background knowledge and guidelines while generating the output [22].

1.2.1 Guardrails & Filtering

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Mitigation in Input Modules > Malicious Prompt Detection

Different from the methods of designing defensive prompts to preprocess the input, the malicious prompt detection method aims to detect and filter out the harmful prompts through the input safeguard

1.2.1 Guardrails & Filtering

Lifecycle:Operate and MonitorActor:DeployerAIRM:Manage

Mitigation in Language Models

This section delves into mitigating risks associated with models, encompassing privacy preservation, detoxification and debiasing, mitigation of hallucinations, and defenses against model attacks.

1.1 Model

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Mitigation in Language Models > Privacy Preserving

Privacy leakage is a crucial risk of LLMs, since the powerful memorization and association capabilities of LLMs raise the risk of revealing private information within the training data. Researchers are devoted to designing privacypreserving frameworks in LLMs [226], [227], aiming to safeguard sensitive PII from possible disclosure during humanmachine conservation

1.1 Model

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Mitigation in Language Models > Detoxifying and Debiasing

To reduce the toxicity and bias of LLMs, prior efforts mainly focus on enhancing the quality of training data and conducting safety training.

1.1 Model

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

View all 34 mitigations from this source →

Source Document

Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems

Cui, Tianyu; Wang, Yanling; Fu, Chuanpu; Xiao, Yong; Li, Sijia; Deng, Xinhao; Liu, Yunpeng; Zhang, Qinglin; Qiu, Ziyi; Li, Peiyang; Tan, Zhixing; Xiong, Junwu; Kong, Xinyu; Wen, Zujie; Xu, Ke; Li, Qi (2024)

Despite their impressive capabilities, large lan- guage models (LLMs) have been observed to generate responses that include inaccurate or fabricated information, a phenomenon com- monly known as “hallucination”. In this work, we propose a simple Induce-then-Contrast De- coding (ICD) strategy to alleviate hallucina- tions. We first construct a factually weak LLM by inducing hallucinations from the original LLMs. Then, we penalize these induced hallu- cinations during decoding to enhance the fac- tuality of the generated content. Concretely, we determine the final next-token predictions by amplifying the predictions from the orig- inal model and downplaying the induced un- truthful predictions via contrastive decoding. Experimental results on both discrimination- based and generation-based hallucination eval- uation benchmarks, such as TruthfulQA and FACTSCORE, demonstrate that our proposed ICD methods can effectively enhance the factu- ality of LLMs across various model sizes and families. For example, when equipped with ICD, Llama2-7B-Chat and Mistral-7B-Instruct achieve performance comparable to ChatGPT and GPT4 on TruthfulQA, respectively.

View source DOI: 10.48550/arXiv.2401.05778

Classification

AI Lifecycle Stage

Build and Use Model

Training, fine-tuning, and integrating the AI model

Responsible Actor

Developer

Entity that creates, trains, or modifies the AI system

NIST AI RMF Function

Manage

Prioritising, responding to, and mitigating AI risks

Measure

Risk Domains

Primary

3.1 False or misleading information

Other

7.3 Lack of capability or robustness