This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
User vetting, access restrictions, encryption, and infrastructure security for deployed systems.
Also in Operations & Security
Access Control Mitigations govern who can use a model, what capabilities they can access, and how the model can interact with external systems. Approaches include user verification protocols, tiered access levels, geographic restrictions, sandboxed execution environments, and restrictions on model permissions. These mitigations can help reduce risk by preventing unauthorized actors from accessing advanced capabilities, limiting which features are available to different user groups, and constraining how models can interact with external systems.
Reasoning
Access control restricts user eligibility and system entry through organizational security practices.
5.1 Access Control Frameworks
While developers typically implement mitigations for general access deployments, they may complement this with additional access controls when providing versions with modified or reduced mitigations for specialized use cases. For example, a developer might provide safety researchers with access to a model version with reduced refusal training to enable more comprehensive risk assessments, or provide verified medical researchers with a model that has fewer restrictions on discussing pathogen characteristics for legitimate vaccine development work. Such provisions may not be necessary if standard safeguards are sufficient for all intended users and use cases. In such cases, developers may implement access restrictions so that only verified and authorized users can interact with models with these modified mitigations. To determine appropriate access levels, developers might define "acceptable use policies" based on threat modeling and cost-benefit analysis. Such frameworks can then serve as a guide for reviewing users’ intended use cases, validating user identity and trustworthiness, and determining the appropriate access given the required modifications to safeguards.
2.3.2 Access & Security Controls5.2 Staged Deployment
Developers may implement staged rollouts for powerful new models, starting with highly controlled environments and gradually expanding access as controls are validated. Initial deployment could involve small groups of external users or research partners operating under strict monitoring agreements, allowing developers to observe model behavior and identify potential risks before broader release. Subsequent stages could then expand systematically: For example, deploying first to verified commercial customers with specific use cases, then to researchers and academics with appropriate credentials, and eventually to the broader public with appropriate safeguards. Each stage provides data about usage patterns and potential risks that inform the next phase of deployment.
2.3.1 Deployment Management5.3 Model Permissions and Sandboxing
Beyond controlling user access, developers implement restrictions on what models themselves can do within their operating environment. For example, sandboxing can isolate model execution from sensitive systems through contained environments with limited network access, restricted file system permissions, API rate limits, and disabled access to external tools for high-risk operations. Permission systems can further control model capabilities by requiring human approval for code execution, limiting database access based on user authorization, requiring explicit consent before accessing sensitive data, and implementing time limits on autonomous operation.
1.2.2 Runtime EnvironmentCapability Limitation Mitigations
Capability limitation mitigations aim to prevent models from possessing knowledge or abilities that could enable harm. These methods alter the model’s weights or training process, so that it cannot assist with harmful actions when prompted by humans or autonomously pursue harmful objectives.
1.1.3 Capability ModificationCapability Limitation Mitigations > Data Filtering
Data filtering involves removing content from training datasets that could lead to dual-use or potentially harmful capabilities. Developers can use several methods: automated classifiers to identify and remove content related to weapons development, detailed attack methodologies, or other high-risk domains; keyword-based filters to exclude documents containing specific terminology or instructions of concern; and machine learning models trained to recognize subtle patterns in content that might contribute to dangerous capabilities.
1.1.1 Training DataCapability Limitation Mitigations > Exploratory Methods
Beyond data filtering, researchers are investigating additional capability limitation approaches
1.1.3 Capability ModificationCapability Limitation Mitigations
Capability limitation mitigations aim to prevent models from possessing knowledge or abilities that could enable harm. These methods alter the model's weights or training process, so that it cannot assist with harmful actions when prompted by humans or autonomously pursue harmful objectives. However, the effectiveness of these mitigations is an active area of research, and they can currently be circumvented if dual-use knowledge (knowledge that has both benign and harmful applications) is added in the context window during inference or fine-tuning.
1.1.3 Capability ModificationCapability Limitation Mitigations > 2.1 Data Filtering
Data filtering involves removing content from training datasets that could lead to dual-use or potentially harmful capabilities. Developers can use several methods: automated classifiers to identify and remove content related to weapons development, detailed attack methodologies, or other high-risk domains; keyword-based filters to exclude documents containing specific terminology or instructions of concern; and machine learning models trained to recognize subtle patterns in content that might contribute to dangerous capabilities.
1.1.1 Training DataCapability Limitation Mitigations > 2.2 Exploratory Methods
Beyond data filtering, researchers are investigating additional capability limitation approaches. However, these methods face technical challenges, and their effectiveness remains uncertain. ● Model distillation could create specialized versions of frontier models with capabilities limited to specific domains. For example, a model could excel at medical diagnosis while lacking knowledge needed for biological weapons development. While the capability limitations may be more fundamental than post-hoc safety training, it remains unclear how effectively this approach prevents harmful capabilities from being reconstructed. Additionally, multiple specialized models would be needed to cover various use cases, increasing development and maintenance costs. ● Targeted unlearning attempts to remove specific dangerous capabilities from models after initial training, offering a more precise alternative to full retraining. Possible approaches include fine-tuning on datasets to overwrite specific knowledge while preserving general capabilities, or modifying how models internally structure and access particular information. However, these methods may be reversible with relatively modest effort – restoring "unlearned" capabilities through targeted fine-tuning with small datasets. Models may also regenerate removed knowledge by inferring from adjacent information that remains accessible. While research continues on these approaches, developers currently rely more heavily on post-deployment mitigations that can be more reliably implemented and assessed.
1.1.3 Capability ModificationFrontier Mitigations
Frontier Model Forum (2025)
Frontier mitigations are protective measures implemented on frontier models, with the goal of reducing the risk of potential high-severity harms, especially those related to national security and public safety, that could arise from their advanced capabilities. This report discusses emerging industry practices for implementing and assessing frontier mitigations. It focuses on mitigations for managing risks in three primary domains: chemical, biological, radiological and nuclear (CBRN) information threats; advanced cyber threats; and advanced autonomous behavior threats. Given the nascent state of frontier mitigations, this report describes the range of controls and mitigation strategies being employed or researched by Frontier Model Forum members and documents the known limitations of these approaches.
Deploy
Releasing the AI system into a production environment
Deployer
Entity that integrates and deploys the AI system for end users
Manage
Prioritising, responding to, and mitigating AI risks
Primary
4 Malicious Actors & Misuse