Perform other activities documented in a…

BackDefenses

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Defenses

Zhang (2025)|LLM classified

Mitigation Taxonomy

1AI System

Technical mechanisms and engineering interventions that directly modify how an AI system processes inputs, generates outputs, or operates, including changes to models, training procedures, runtime behaviors, and supporting hardware.

LLM Classification Details

Reasoning

Term "Defenses" too vague to identify focal activity or determine L1 category without description.

Code: 99.9Version: v0.5Classified: Jan 22, 2026

Sub-mitigations (2)

Inference Time Defense

1.2 Non-Model

Lifecycle:Other (general)Actor:OtherAIRM:Other

Training Time Defense

1.1 Model

Lifecycle:Other (general)Actor:OtherAIRM:Other

Other mitigations from Zhang (2025) (21)

Defenses > Inference Time Defense

1.2 Non-Model

Lifecycle:Other (general)Actor:OtherAIRM:Other

Defenses > Training Time Defense

1.1 Model

Lifecycle:Other (general)Actor:OtherAIRM:Other

Inference Time Defense > Preprocess

The input is preprocessed to detect or guard against harmful content before the generation process.

1.2.1 Guardrails & Filtering

Lifecycle:Other (general)Actor:OtherAIRM:Other

Inference Time Defense > Intraprocess

Utilizes safer decoding or generation strategies to ensure the outputs are secure and adhere to safety guidelines

1.2 Non-Model

Lifecycle:Other (general)Actor:OtherAIRM:Other

Inference Time Defense > Postprocess

The output is postprocessed to enhance safety.

1.2.1 Guardrails & Filtering

Lifecycle:Other (general)Actor:OtherAIRM:Other

Preprocess > PPL

(Alon and Kamfonas, 2023)

1.2.3 Monitoring & Detection

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Measure

View all 21 mitigations from this source →

Source Document

AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement

Zhang, Zhexin; Lei, Leqi; Yang, Junxiao; Huang, Xijie; Lu, Yida; Cui, Shiyao; Chen, Renmiao; Zhang, Qinglin; Wang, Xinyuan; Wang, Hao; Li, Hao; Lei, Xianqi; Pan, Chengwei; Sha, Lei; Wang, Hongning; Huang, Minlie (2025)

As AI models are increasingly deployed across diverse real-world scenarios, ensuring their safety remains a critical yet underexplored challenge. While substantial efforts have been made to evaluate and enhance AI safety, the lack of a standardized framework and comprehensive toolkit poses significant obstacles to systematic research and practical adoption. To bridge this gap, we introduce AISafetyLab, a unified framework and toolkit that integrates representative attack, defense, and evaluation methodologies for AI safety. AISafetyLab features an intuitive interface that enables developers to seamlessly apply various techniques while maintaining a well-structured and extensible codebase for future advancements. Additionally, we conduct empirical studies on Vicuna, analyzing different attack and defense strategies to provide valuable insights into their comparative effectiveness. To facilitate ongoing research and development in AI safety, AISafetyLab is publicly available at https://github.com/thu-coai/AISafetyLab, and we are committed to its continuous maintenance and improvement.

View source DOI: 10.48550/arXiv.2502.16776

Classification

AI Lifecycle Stage

Other (general)

General mitigation not specific to a single lifecycle stage

Responsible Actor

Other

Actor type not captured by the standard categories

NIST AI RMF Function

Other

Risk management function not captured by the standard AIRM categories

Risk Domains

Primary

2.2 AI system security vulnerabilities and attacks

Other

4 Malicious Actors & Misuse