BackSocietal readiness mitigations

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Societal readiness mitigations

Shah (2025)|LLM classified

Mitigation Taxonomy

1AI System

Technical mechanisms and engineering interventions that directly modify how an AI system processes inputs, generates outputs, or operates, including changes to models, training procedures, runtime behaviors, and supporting hardware.

Definition

use AI systems to harden societal defenses; for example, it aims to prepare for AI cyber-offense capabilities by enabling rapid patching of vulnerabilities in critical infrastructure

LLM Classification Details

Reasoning

Cross-organization coordination to harden societal infrastructure against AI cyber-offense through rapid vulnerability patching.

Code: 3.3.1Version: v0.6Classified: Jan 27, 2026

Other mitigations from Shah (2025) (8)

Safety post-training

Developers can teach models not to fulfil harmful requests during posttraining. Such an approach would need to additionally ensure that the model is resistant to jailbreaks.

1.1.2 Learning Objectives

Lifecycle:DeployActor:DeployerAIRM:Manage

Capability suppression

It would be ideal to remove the dangerous capability entirely (aka “unlearning”), though this has so far remained elusive technically, and may impair beneficial use cases too much to be used in practice.

1.1.3 Capability Modification

Lifecycle:DeployActor:DeployerAIRM:Manage

System level deployment

1.2 Non-Model

Lifecycle:DeployActor:DeployerAIRM:Manage

System level deployment > Monitoring

Monitoring involves detecting when a threat actor is attempting to inappropriately access dangerous capabilities, and responding in a way that prevents them from using any access to cause severe harm. Detection can be accomplished by using classifiers that output harm probability scores, leveraging the internal activations of the model, or auditing generated content manually.

1.2.3 Monitoring & Detection

Lifecycle:DeployActor:DeployerAIRM:Manage

Security mitigations

aim to prevent bad actors from stealing an AI system with dangerous capabilities. While many such mitigations are appropriate for security more generally, there are novel mitigations more specific to the challenge of defending AI models in particular.

1.2.4 Security Infrastructure

Lifecycle:Operate and MonitorActor:Governance ActorAIRM:Govern

Security mitigations > Identity and access control

. All identity must be verified, and multi-factor methods like security key enforced. A dedicated team is continuously supervising who has access to what part of the model infrastructure and how each individual uses the access to observe opportunities for reducing access and anomalies (e.g. early capture of social engineering).

1.2.4 Security Infrastructure

Lifecycle:Operate and MonitorActor:Affected StakeholderAIRM:Govern

View all 8 mitigations from this source →

Source Document

An Approach to Technical AGI Safety and Security

Shah, Rohin; Irpan, Alex; Turner, Alexander Matt; Wang, Anna; Conmy, Arthur; Lindner, David; Brown-Cohen, Jonah; Ho, Lewis; Nanda, Neel; Popa, Raluca Ada; Jain, Rishub; Greig, Rory; Albanie, Samuel; Emmons, Scott; Farquhar, Sebastian; Krier, Sébastien; Rajamanoharan, Senthooran; Bridgers, Sophie; Ijitoye, Tobi; Everitt, Tom; Krakovna, Victoria; Varma, Vikrant; Mikulik, Vladimir; Kenton, Zachary; Orr, Dave; Legg, Shane; Goodman, Noah; Dafoe, Allan; Flynn, Four; Dragan, Anca (2025)

Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity. We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse and misalignment. For misuse, our strategy aims to prevent threat actors from accessing dangerous capabilities, by proactively identifying dangerous capabilities, and implementing robust security, access restrictions, monitoring, and model safety mitigations. To address misalignment, we outline two lines of defense. First, model-level mitigations such as amplified oversight and robust training can help to build an aligned model. Second, system-level security measures such as monitoring and access control can mitigate harm even if the model is misaligned. Techniques from interpretability, uncertainty estimation, and safer design patterns can enhance the effectiveness of these mitigations. Finally, we briefly outline how these ingredients could be combined to produce safety cases for AGI systems.

View source DOI: 10.48550/arXiv.2504.01849

Classification

AI Lifecycle Stage

Operate and Monitor

Running, maintaining, and monitoring the AI system post-deployment

Responsible Actor

Governance Actor

Regulator, standards body, or oversight entity shaping AI policy

NIST AI RMF Function

Govern

Policies, processes, and accountability structures for AI risk management

Risk Domains

Primary

4.2 Cyberattacks, weapon development or use, and mass harm

Other

7.2 AI possessing dangerous capabilities