This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Techniques to remove, bound, or modify learned model capabilities post-training.
Also in Model
Safe Unlearning (Zhang et al., 2024d)
Reasoning
Unlearning removes or modifies specific learned capabilities from trained model weights post-training.
Training Time Defense
Defenses
1 AI SystemDefenses > Inference Time Defense
1.2 Non-ModelDefenses > Training Time Defense
1.1 ModelInference Time Defense > Preprocess
The input is preprocessed to detect or guard against harmful content before the generation process.
1.2.1 Guardrails & FilteringInference Time Defense > Intraprocess
Utilizes safer decoding or generation strategies to ensure the outputs are secure and adhere to safety guidelines
1.2 Non-ModelInference Time Defense > Postprocess
The output is postprocessed to enhance safety.
1.2.1 Guardrails & FilteringAISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement
Zhang, Zhexin; Lei, Leqi; Yang, Junxiao; Huang, Xijie; Lu, Yida; Cui, Shiyao; Chen, Renmiao; Zhang, Qinglin; Wang, Xinyuan; Wang, Hao; Li, Hao; Lei, Xianqi; Pan, Chengwei; Sha, Lei; Wang, Hongning; Huang, Minlie (2025)
As AI models are increasingly deployed across diverse real-world scenarios, ensuring their safety remains a critical yet underexplored challenge. While substantial efforts have been made to evaluate and enhance AI safety, the lack of a standardized framework and comprehensive toolkit poses significant obstacles to systematic research and practical adoption. To bridge this gap, we introduce AISafetyLab, a unified framework and toolkit that integrates representative attack, defense, and evaluation methodologies for AI safety. AISafetyLab features an intuitive interface that enables developers to seamlessly apply various techniques while maintaining a well-structured and extensible codebase for future advancements. Additionally, we conduct empirical studies on Vicuna, analyzing different attack and defense strategies to provide valuable insights into their comparative effectiveness. To facilitate ongoing research and development in AI safety, AISafetyLab is publicly available at https://github.com/thu-coai/AISafetyLab, and we are committed to its continuous maintenance and improvement.
Build and Use Model
Training, fine-tuning, and integrating the AI model
Developer
Entity that creates, trains, or modifies the AI system
Manage
Prioritising, responding to, and mitigating AI risks