This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Technical mechanisms and engineering interventions that directly modify how an AI system processes inputs, generates outputs, or operates, including changes to models, training procedures, runtime behaviors, and supporting hardware.
use AI systems to harden societal defenses; for example, it aims to prepare for AI cyber-offense capabilities by enabling rapid patching of vulnerabilities in critical infrastructure
Reasoning
Cross-organization coordination to harden societal infrastructure against AI cyber-offense through rapid vulnerability patching.
Safety post-training
Developers can teach models not to fulfil harmful requests during posttraining. Such an approach would need to additionally ensure that the model is resistant to jailbreaks.
1.1.2 Learning ObjectivesCapability suppression
It would be ideal to remove the dangerous capability entirely (aka “unlearning”), though this has so far remained elusive technically, and may impair beneficial use cases too much to be used in practice.
1.1.3 Capability ModificationSystem level deployment
1.2 Non-ModelSystem level deployment > Monitoring
Monitoring involves detecting when a threat actor is attempting to inappropriately access dangerous capabilities, and responding in a way that prevents them from using any access to cause severe harm. Detection can be accomplished by using classifiers that output harm probability scores, leveraging the internal activations of the model, or auditing generated content manually.
1.2.3 Monitoring & DetectionSecurity mitigations
aim to prevent bad actors from stealing an AI system with dangerous capabilities. While many such mitigations are appropriate for security more generally, there are novel mitigations more specific to the challenge of defending AI models in particular.
1.2.4 Security InfrastructureSecurity mitigations > Identity and access control
. All identity must be verified, and multi-factor methods like security key enforced. A dedicated team is continuously supervising who has access to what part of the model infrastructure and how each individual uses the access to observe opportunities for reducing access and anomalies (e.g. early capture of social engineering).
1.2.4 Security InfrastructureAn Approach to Technical AGI Safety and Security
Shah, Rohin; Irpan, Alex; Turner, Alexander Matt; Wang, Anna; Conmy, Arthur; Lindner, David; Brown-Cohen, Jonah; Ho, Lewis; Nanda, Neel; Popa, Raluca Ada; Jain, Rishub; Greig, Rory; Albanie, Samuel; Emmons, Scott; Farquhar, Sebastian; Krier, Sébastien; Rajamanoharan, Senthooran; Bridgers, Sophie; Ijitoye, Tobi; Everitt, Tom; Krakovna, Victoria; Varma, Vikrant; Mikulik, Vladimir; Kenton, Zachary; Orr, Dave; Legg, Shane; Goodman, Noah; Dafoe, Allan; Flynn, Four; Dragan, Anca (2025)
Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity. We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse and misalignment. For misuse, our strategy aims to prevent threat actors from accessing dangerous capabilities, by proactively identifying dangerous capabilities, and implementing robust security, access restrictions, monitoring, and model safety mitigations. To address misalignment, we outline two lines of defense. First, model-level mitigations such as amplified oversight and robust training can help to build an aligned model. Second, system-level security measures such as monitoring and access control can mitigate harm even if the model is misaligned. Techniques from interpretability, uncertainty estimation, and safer design patterns can enhance the effectiveness of these mitigations. Finally, we briefly outline how these ingredients could be combined to produce safety cases for AGI systems.
Operate and Monitor
Running, maintaining, and monitoring the AI system post-deployment
Governance Actor
Regulator, standards body, or oversight entity shaping AI policy
Govern
Policies, processes, and accountability structures for AI risk management