BackMitigation evaluation

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Mitigation evaluation

Buhl (2025)|LLM classified

Mitigation Taxonomy

2Organisation

2.2Risk & Assurance

2.2.2Testing & Evaluation

Red teaming, capability evaluations, adversarial testing, and performance verification.

Also in Risk & Assurance

2.2.1 Risk Assessment2.2.3 Auditing & Compliance2.2.4 Assurance Documentation

Definitionp. 4

“Risk assessments should consider […] the efficacy of implemented mitigations.” (I)

LLM Classification Details

Reasoning

Risk assessment practice evaluating effectiveness of implemented mitigations to reduce identified risks.

Code: 2.2.1Version: v0.5Classified: Jan 22, 2026

Sub-mitigations (5)

Setting specific targets

Being precise about what standards mitigations aim to meet makes testing easier and reduces subjectivity. These targets can be grounded risk modelling (1.2) or based on risk thresholds (1.3).

2.2.4 Assurance Documentation

Lifecycle:Plan and DesignActor:Governance ActorAIRM:Measure

Evaluating robustly

As with capability elicitation, it can be helpful to consider what level of effort is necessary to provide robust evidence of mitigation effectiveness. For example, the developer can use red-teams that are as capable and well-resourced as relevant threat actors. Developers can also aim to use multiple types of evaluations, where feasible, to increase robustness.

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Evaluating comprehensively

Mitigations that are effective in one context – for example, one deployment form or one language – may not be effective in another. As such, developers can assess mitigations in all contexts in which they will be deployed (e.g. API with fine-tuning, API with internet access and tool use, etc.) and assess if mitigations remain effective across natural languages.

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Using safety margins

As with capability evaluations, developers can include a safety margin to account for the risk of overestimating effectiveness (Alaga et al. 2024).

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Re-evaluating regularly

: Mitigation effectiveness may change during development and deployment (e.g. new jailbreaks might be discovered), or new information may appear (e.g. about safeguard vulnerabilities). To address this, developers can regularly re-assess mitigation effectiveness, at regular intervals and/or in response to pre-specified triggers such as a certain incident rate. Developers can also commit to re-assessing mitigation effectiveness when changing or expanding deployment modalities significantly.

2.2.2 Testing & Evaluation

Lifecycle:Operate and MonitorActor:DeveloperAIRM:Measure

Other mitigations from Buhl (2025) (11)

Deployment Mitigations

“Articulate how risk mitigations will be identified and implemented to keep risks within defined thresholds, including safety […] mitigations such as modifying system behaviours.”

1 AI System

Lifecycle:DeployActor:DeployerAIRM:Manage

Deployment Mitigations > 1. Applying “safety by design” principles

Mitigating deployment risk does not mean waiting until after training to implement measures, but includes measures built into the system design and training process, such as data filtering (CISA, 2013).

1.1.1 Training Data

Lifecycle:Plan and DesignActor:DeveloperAIRM:Manage

Deployment Mitigations > Building in redundancy

To increase robustness, developers can aim to avoid single points of failure and instead use a “defense-in-depth” strategy, i.e. implementing multiple mitigations addressing the same threat model or failure mode (Alaga et al., 2024).

1 AI System

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Deployment Mitigations > Considering both internal and external deployment:

Different mitigations may be suitable for internal and external deployment due to different underlying risk models. As such, developers may want to separately specify which mitigations they plan to implement for internal use (e.g. monitoring and logging of interactions, limited affordances, staff training).

2.3.1 Deployment Management

Lifecycle:Plan and DesignActor:DeveloperAIRM:Govern

Applying cybersecurity standards

There are many existing cybersecurity standards that developers can apply to the protection of model weights, such as NCSC’s guidelines on secure model development (NCSC, 2024) and DSIT’s draft AI cyber security Code of Practice (DSIT, 2024). Developers can also refer to RAND’s state-of-the-art overview of security measures specifically focused on model weights (Nevo et al., 2024).

3.2.2 Technical Standards

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

Building in redundancy

As with deployment safeguards, developers can aim to implement a “defense-in-depth” strategy.

2.3 Operations & Security

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

View all 11 mitigations from this source →

Source Document

Emerging Practices in Frontier AI Safety Frameworks

Buhl, Marie Davidsen; Bucknall, Ben; Masterson, Tammy (2025)

As part of the Frontier AI Safety Commitments agreed to at the 2024 AI Seoul Summit, many AI developers agreed to publish a safety framework outlining how they will manage potential severe risks associated with their systems. This paper summarises current thinking from companies, governments, and researchers on how to write an effective safety framework. We outline three core areas of a safety framework - risk identification and assessment, risk mitigation, and governance - and identify emerging practices within each area. As safety frameworks are novel and rapidly developing, we hope that this paper can serve both as an overview of work to date and as a starting point for further discussion and innovation.

View source DOI: 10.48550/arXiv.2503.04746

Classification

AI Lifecycle Stage

Operate and Monitor

Running, maintaining, and monitoring the AI system post-deployment

Responsible Actor

Governance Actor

Regulator, standards body, or oversight entity shaping AI policy

Developer

NIST AI RMF Function

Measure

Quantifying, testing, and monitoring identified AI risks

Risk Domains

Primary

7.2 AI possessing dangerous capabilities

Other

7 AI System Safety, Failures & Limitations