This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Technical mechanisms operating on non-model components of the AI system without modifying model weights. Components include: input/output interfaces, runtime environment, guardrail/monitoring classifiers, tool chain, and hardware.
Also in AI System
Reasoning
Deployment suggests phased/staged system rollout management. Insufficient specifics limit confidence.
Safety post-training
Developers can teach models not to fulfil harmful requests during posttraining. Such an approach would need to additionally ensure that the model is resistant to jailbreaks.
1.1.2 Learning ObjectivesCapability suppression
It would be ideal to remove the dangerous capability entirely (aka “unlearning”), though this has so far remained elusive technically, and may impair beneficial use cases too much to be used in practice.
1.1.3 Capability ModificationSecurity mitigations
aim to prevent bad actors from stealing an AI system with dangerous capabilities. While many such mitigations are appropriate for security more generally, there are novel mitigations more specific to the challenge of defending AI models in particular.
1.2.4 Security InfrastructureSecurity mitigations > Identity and access control
. All identity must be verified, and multi-factor methods like security key enforced. A dedicated team is continuously supervising who has access to what part of the model infrastructure and how each individual uses the access to observe opportunities for reducing access and anomalies (e.g. early capture of social engineering).
1.2.4 Security InfrastructureSecurity mitigations > Environment hardening
The environment exposed to the model should be hardened from pre-training all the way to post-training, and inference time, and enhanced with monitoring and logging, exfiltration detection and resistance, detection and remediation of improper storage.
1.2 Non-ModelSecurity mitigations > Encrypted processing
Encryption in use, namely keeping the model and its weights encrypted at all time, promises to protect the model weights against attackers who managed to break into the servers.
1.2.4 Security InfrastructureAn Approach to Technical AGI Safety and Security
Shah, Rohin; Irpan, Alex; Turner, Alexander Matt; Wang, Anna; Conmy, Arthur; Lindner, David; Brown-Cohen, Jonah; Ho, Lewis; Nanda, Neel; Popa, Raluca Ada; Jain, Rishub; Greig, Rory; Albanie, Samuel; Emmons, Scott; Farquhar, Sebastian; Krier, Sébastien; Rajamanoharan, Senthooran; Bridgers, Sophie; Ijitoye, Tobi; Everitt, Tom; Krakovna, Victoria; Varma, Vikrant; Mikulik, Vladimir; Kenton, Zachary; Orr, Dave; Legg, Shane; Goodman, Noah; Dafoe, Allan; Flynn, Four; Dragan, Anca (2025)
Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity. We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse and misalignment. For misuse, our strategy aims to prevent threat actors from accessing dangerous capabilities, by proactively identifying dangerous capabilities, and implementing robust security, access restrictions, monitoring, and model safety mitigations. To address misalignment, we outline two lines of defense. First, model-level mitigations such as amplified oversight and robust training can help to build an aligned model. Second, system-level security measures such as monitoring and access control can mitigate harm even if the model is misaligned. Techniques from interpretability, uncertainty estimation, and safer design patterns can enhance the effectiveness of these mitigations. Finally, we briefly outline how these ingredients could be combined to produce safety cases for AGI systems.
Deploy
Releasing the AI system into a production environment
Deployer
Entity that integrates and deploys the AI system for end users
Manage
Prioritising, responding to, and mitigating AI risks