BackSystem level deployment

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

System level deployment

Shah (2025)|LLM classified

Mitigation Taxonomy

1AI System

1.2Non-Model

Technical mechanisms operating on non-model components of the AI system without modifying model weights. Components include: input/output interfaces, runtime environment, guardrail/monitoring classifiers, tool chain, and hardware.

Also in AI System

1.1 Model

LLM Classification Details

Reasoning

Deployment suggests phased/staged system rollout management. Insufficient specifics limit confidence.

Code: 2.3.1Version: v0.5Classified: Jan 22, 2026

Sub-mitigations (1)

Monitoring

Monitoring involves detecting when a threat actor is attempting to inappropriately access dangerous capabilities, and responding in a way that prevents them from using any access to cause severe harm. Detection can be accomplished by using classifiers that output harm probability scores, leveraging the internal activations of the model, or auditing generated content manually.

1.2.3 Monitoring & Detection

Lifecycle:DeployActor:DeployerAIRM:Manage

Other mitigations from Shah (2025) (8)

Safety post-training

Developers can teach models not to fulfil harmful requests during posttraining. Such an approach would need to additionally ensure that the model is resistant to jailbreaks.

1.1.2 Learning Objectives

Lifecycle:DeployActor:DeployerAIRM:Manage

Capability suppression

It would be ideal to remove the dangerous capability entirely (aka “unlearning”), though this has so far remained elusive technically, and may impair beneficial use cases too much to be used in practice.

1.1.3 Capability Modification

Lifecycle:DeployActor:DeployerAIRM:Manage

Security mitigations

aim to prevent bad actors from stealing an AI system with dangerous capabilities. While many such mitigations are appropriate for security more generally, there are novel mitigations more specific to the challenge of defending AI models in particular.

1.2.4 Security Infrastructure

Lifecycle:Operate and MonitorActor:Governance ActorAIRM:Govern

Security mitigations > Identity and access control

. All identity must be verified, and multi-factor methods like security key enforced. A dedicated team is continuously supervising who has access to what part of the model infrastructure and how each individual uses the access to observe opportunities for reducing access and anomalies (e.g. early capture of social engineering).

1.2.4 Security Infrastructure

Lifecycle:Operate and MonitorActor:Affected StakeholderAIRM:Govern

Security mitigations > Environment hardening

The environment exposed to the model should be hardened from pre-training all the way to post-training, and inference time, and enhanced with monitoring and logging, exfiltration detection and resistance, detection and remediation of improper storage.

1.2 Non-Model

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

Security mitigations > Encrypted processing

Encryption in use, namely keeping the model and its weights encrypted at all time, promises to protect the model weights against attackers who managed to break into the servers.

1.2.4 Security Infrastructure

Lifecycle:Operate and MonitorActor:Infrastructure ProviderAIRM:Manage

View all 8 mitigations from this source →

Source Document

An Approach to Technical AGI Safety and Security

Shah, Rohin; Irpan, Alex; Turner, Alexander Matt; Wang, Anna; Conmy, Arthur; Lindner, David; Brown-Cohen, Jonah; Ho, Lewis; Nanda, Neel; Popa, Raluca Ada; Jain, Rishub; Greig, Rory; Albanie, Samuel; Emmons, Scott; Farquhar, Sebastian; Krier, Sébastien; Rajamanoharan, Senthooran; Bridgers, Sophie; Ijitoye, Tobi; Everitt, Tom; Krakovna, Victoria; Varma, Vikrant; Mikulik, Vladimir; Kenton, Zachary; Orr, Dave; Legg, Shane; Goodman, Noah; Dafoe, Allan; Flynn, Four; Dragan, Anca (2025)

Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity. We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse and misalignment. For misuse, our strategy aims to prevent threat actors from accessing dangerous capabilities, by proactively identifying dangerous capabilities, and implementing robust security, access restrictions, monitoring, and model safety mitigations. To address misalignment, we outline two lines of defense. First, model-level mitigations such as amplified oversight and robust training can help to build an aligned model. Second, system-level security measures such as monitoring and access control can mitigate harm even if the model is misaligned. Techniques from interpretability, uncertainty estimation, and safer design patterns can enhance the effectiveness of these mitigations. Finally, we briefly outline how these ingredients could be combined to produce safety cases for AGI systems.

View source DOI: 10.48550/arXiv.2504.01849

Classification

AI Lifecycle Stage

Deploy

Releasing the AI system into a production environment

Responsible Actor

Deployer

Entity that integrates and deploys the AI system for end users

NIST AI RMF Function

Manage

Prioritising, responding to, and mitigating AI risks

Risk Domains

Primary

7.1 AI pursuing its own goals in conflict with human goals or values

Other

7.2 AI possessing dangerous capabilities 2.2 AI system security vulnerabilities and attacks