BackSafety Pre-training & Post-training Measures

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Safety Pre-training & Post-training Measures

Tse (2025)|LLM classified

Mitigation Taxonomy

1AI System

1.1Model

1.1.2Learning Objectives

Training methods that shape model behavior through objectives, feedback, and optimization targets.

Also in Model

1.1.1 Training Data1.1.3 Capability Modification1.1.4 Model Architecture

Definition

The safety pre-training and post-training phase is a key line of defense against AI risks. The core objective is to enhance the model's alignment with human intent and ability to identify and refuse harmful instructions 56 , and to limit the formation and expression of dangerous capabilities from the outset.

LLM Classification Details

Reasoning

Pre- and post-training phases modify model behavior through alignment objectives and refusal training.

Code: 1.1.2Version: v0.5Classified: Jan 22, 2026

Sub-mitigations (9)

Training data filters & unlearning

Filter out data that could be hazardous, such as bioweapon and gain-of-function-related knowledge. While currently less successful, unlearning techniques could also be applied to make hazardous knowledge more difficult for users to access.

1.1 Model

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

Further research into foundational approaches such as Safety-By-Design and Quantitative Safety Guarantees

Safety-By-Design focuses on integrating safety principles into model architecture and training processes from the outset, reducing the potential for harmful capabilities to emerge. Quantitative Safety Guarantees aim to provide measurable, mathematically grounded assurances that risks remain below predefined thresholds, enhancing trust in model behavior across diverse scenarios. These approaches strengthen the foundation for safe AI deployment, complementing existing safeguards and addressing evolving challenges in high-risk contexts.

2.4.1 Research & Foundations

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Safety alignment training against harmful instructions

Through alignment training (e.g., RLHF/RLAIF) and red-team-driven fine-tuning, enhance the model's ability to recognize and refuse high-risk content related to violence, weapon development, etc.

1.1.2 Learning Objectives

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Embedding safety values and behavioral constraints

Inject constraints aligned with values like honesty and controllability during training to ensure the model adheres to human intent in complex scenarios

1.1.2 Learning Objectives

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Real-time monitoring of reasoning processes

Introduce automated chain-of-thought monitoring to identify anomalies or potentially malicious behaviors during reasoning, to help detect deceptive, conspiratorial, or manipulative outputs.

1.2.3 Monitoring & Detection

Lifecycle:Operate and MonitorActor:DeveloperAIRM:Measure

Enhancing interpretability and formal verification

Use techniques like neural network reverse engineering to analyze internal mechanisms and identify risks; combine with formal verification methods to mathematically validate critical behaviors, increasing trustworthiness.

2.2 Risk & Assurance

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Restricting dangerous capability formation

Employ unlearning and capability boundary control to suppress the development of abilities tied to high-risk tasks without significantly undermining general performance.

1.1.3 Capability Modification

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Differentiated fine-tuning strategies

Design targeted fine-tuning paths based on risk levels and application scenarios to improve the model's safety adaptability in specific contexts.

1.1.2 Learning Objectives

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Enhancing model anomaly detection capabilities

Train the model to be sensitive to anomalous behaviors, enabling it to automatically halt execution or issue alerts when high-risk instructions are triggered.

1.1.2 Learning Objectives

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Other mitigations from Tse (2025) (55)

Model Deployment Mitigation Measures

Deployment mitigation measures are designed to reduce the risks arising from improper use of models through a series of technical and governance approaches, to limit the possibility of models being misused in sensitive or dangerous areas, and to reduce their propensity to trigger unintended consequences. The core objective of these measures is to ensure that AI models can be used safely and compliantly by internal and external users while maximizing their social and economic value.

2.3.1 Deployment Management

Lifecycle:DeployActor:DeployerAIRM:Manage

Model Deployment Mitigation Measures > Mitigations for Model Misuse

1.2 Non-Model

Lifecycle:DeployActor:DeployerAIRM:Manage

Model Deployment Mitigation Measures > Mitigation Measures for Agent Safety and Security

AI agent developers are tasked with ensuring the safety, transparency, and reliability of AI agents through specific measures

1.2 Non-Model

Lifecycle:DeployActor:DeveloperAIRM:Manage

Model Security Mitigation Measures

Security measures aim to effectively control the access rights of different stakeholders to the AI model through a refined permission management mechanism, so as to protect the core assets of the AI model—especially its weighting parameters and related systems—from unauthorized access, theft, or malicious damage.

2.3.2 Access & Security Controls

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Model Security Mitigation Measures > Mitigating the Risk of Model Exfiltration

2.3.2 Access & Security Controls

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Model Security Mitigation Measures > Mitigating the Risk of Model Loss of Control

1.2.4 Security Infrastructure

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

View all 55 mitigations from this source →

Source Document

Frontier AI Risk Management Framework (v1.0)

Tse, Brian; Fang, Liang; Xu, Jia; Duan, Yawen; Shao, Jing (2025)

The field of Artificial Intelligence (AI) is rapidly advancing, with systems increasingly performing at or above human levels across various domains. These breakthroughs offer unprecedented opportunities to address humanity's greatest challenges, from scientific breakthroughs and improved healthcare to enhanced economic productivity. However, this rapid progress also introduces unprecedented risks. As advanced AI development and deployment outpace crucial safety measures, the need for robust risk management has never been more critical. Shanghai Artificial Intelligence Laboratory is an advanced research institute focusing on AI research and application. Working in concert with universities and industry, we explore the future of AI by conducting original and forward-looking scientific research that makes fundamental contributions to basic theory as well as innovations in various technological fields. We strive to become a top-tier global AI Laboratory, committed to the safe and beneficial development of AI. To proactively navigate these challenges and foster a global “race to the top” in AI safety, we have proposed the AI-45° Law,1 a roadmap to trustworthy AGI. Introducing our Frontier AI Risk Management Framework Today, Shanghai AI Laboratory, in collaboration with Concordia AI,2 is proud to introduce the Frontier AI Risk Management Framework v1.0 (the “Framework”). We propose a robust set of protocols designed to empower general-purpose AI developers with comprehensive guidelines for proactively identifying, assessing, mitigating, and governing a set of severe AI risks that pose threats to public safety and national security, thereby safeguarding individuals and society. This framework serves as a guideline for general-purpose AI model developers to manage the potential severe risks from their general-purpose AI models. This framework aligns with standards and best practices in risk management of safety-critical industries. It encompasses six interconnected stages: risk identification, risk thresholds, risk analysis, risk evaluation, risk mitigation, and risk governance.

View source

Classification

AI Lifecycle Stage

Build and Use Model

Training, fine-tuning, and integrating the AI model

Responsible Actor

Developer

Entity that creates, trains, or modifies the AI system

NIST AI RMF Function

Manage

Prioritising, responding to, and mitigating AI risks

Risk Domains

Primary

7.1 AI pursuing its own goals in conflict with human goals or values

Other

7.2 AI possessing dangerous capabilities