This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Red teaming, capability evaluations, adversarial testing, and performance verification.
Also in Risk & Assurance
Following the implementation of mitigations, it is important for developers to verify that the mitigation measures effectively reduce risks to acceptable levels under realistic conditions.
Reasoning
Assesses frontier mitigation effectiveness through testing and evaluation activities.
7.1 Testing Individual Controls
Developers may test specific technical properties of individual safeguards, such as precision and recall of classifier models, jailbreak detection rates of detection systems, or effectiveness of behavioral guardrails against prompt variations. Developers may also use generative models to create diverse test cases, helping to identify coverage gaps. Testing individual controls enables more precise identification of component-specific weaknesses, earlier issue detection, and targeted improvements.
2.2.2 Testing & Evaluation7.2 Testing Controls in Combination
Developers may test mitigations in combination to evaluate how safeguards function as an integrated system. This can give a clearer picture of how the combined system will behave under realistic deployment conditions or reveal emergent vulnerabilities from components that function correctly in isolation. Possible approaches include: ● Internal Safety Teams: Teams within the organization conduct structured testing using knowledge of model architecture and training methods. This approach benefits from institutional knowledge but may suffer from blind spots due to organizational biases. ● External Red-Teams: Specialized firms that conduct adversarial testing with domain-specific expertise (e.g., biosecurity, cybersecurity), using established frameworks, and bringing experience from testing multiple models across organizations. ● Researcher Access Programs: Selected academic and non-profit researchers receive privileged access to conduct independent analyses, focusing on specific research questions in emerging areas like interpretability or novel attack vectors, and publish findings after responsible disclosure periods. ● Bug Bounty Programs: Open programs financially incentivize external researchers to discover and report vulnerabilities (e.g., novel jailbreaking techniques) through predetermined reward structures, with clear scope definitions and verification procedures to manage submissions. ● Red Team vs Blue Team Exercises: Developers may run structured adversarial exercises where red teams attempt to exploit vulnerabilities in mitigation systems while blue teams defend and monitor for attacks. These exercises can stress-test monitoring systems against increasingly sophisticated model behaviors, attack strategies, and autonomous misalignment scenarios. ● System Safety Analysis: Although not yet common practice, system safety frameworks like System-Theoretic Process Analysis (STPA) may also be valuable for identifying hazards that emerge from system component interactions rather than from individual component failures. ● Ongoing Deployment Monitoring: Developers may track mitigation effectiveness throughout deployment, as safeguard efficacy can degrade when attackers develop new circumvention methods. Operational metrics might include detection system performance, attack sophistication trends, and intervention success rates, enabling timely updates when protections require reinforcement.
2.2.2 Testing & Evaluation7.3 Modeling Residual Risk
Some developers and third-party researchers have begun exploring structured approaches for translating safeguard evaluations into residual risk estimates, considering the individual controls, their effectiveness in combination, and the surrounding infrastructure security. For example, creating a model of how effectively the mitigation system reduces risks compared to an unmitigated baseline, and using sensitivity analysis to identify which factors most significantly impact residual risk. These models may integrate information including: 1. Data about safeguard robustness – how consistently safety measures perform across different inputs and how much effort is required to circumvent them. 2. Domain expert estimates, collected through surveys and/or structured forecasting exercises, of the distribution of potential adversaries with varying capabilities and motivations, including their resourcing and willingness to invest time in circumvention attempts.
2.2.1 Risk Assessment7.4 Documenting Mitigation Assessments
Some developers have begun to explore structured documentation approaches for their mitigation assessments. This approach is inspired by safety case approaches in other industries and may involve documenting key claims about system safety, supporting evidence, critical assumptions, and linking specific evidence from evaluations to safety claims. For example, Anthropic published a report explaining the design of the mitigation measures applied to the Claude 4 model, the rationale for the residual risk of the deployment being sufficiently low, and a set of ongoing operating requirements for the mitigations to maintain their effectiveness. This approach could also be used for autonomy-related risks. Like the documentation of frontier capability assessments, this method can facilitate internal decision-making and more transparent communication with external stakeholders about safety considerations.
2.2.4 Assurance DocumentationCapability Limitation Mitigations
Capability limitation mitigations aim to prevent models from possessing knowledge or abilities that could enable harm. These methods alter the model’s weights or training process, so that it cannot assist with harmful actions when prompted by humans or autonomously pursue harmful objectives.
1.1.3 Capability ModificationCapability Limitation Mitigations > Data Filtering
Data filtering involves removing content from training datasets that could lead to dual-use or potentially harmful capabilities. Developers can use several methods: automated classifiers to identify and remove content related to weapons development, detailed attack methodologies, or other high-risk domains; keyword-based filters to exclude documents containing specific terminology or instructions of concern; and machine learning models trained to recognize subtle patterns in content that might contribute to dangerous capabilities.
1.1.1 Training DataCapability Limitation Mitigations > Exploratory Methods
Beyond data filtering, researchers are investigating additional capability limitation approaches
1.1.3 Capability ModificationCapability Limitation Mitigations
Capability limitation mitigations aim to prevent models from possessing knowledge or abilities that could enable harm. These methods alter the model's weights or training process, so that it cannot assist with harmful actions when prompted by humans or autonomously pursue harmful objectives. However, the effectiveness of these mitigations is an active area of research, and they can currently be circumvented if dual-use knowledge (knowledge that has both benign and harmful applications) is added in the context window during inference or fine-tuning.
1.1.3 Capability ModificationCapability Limitation Mitigations > 2.1 Data Filtering
Data filtering involves removing content from training datasets that could lead to dual-use or potentially harmful capabilities. Developers can use several methods: automated classifiers to identify and remove content related to weapons development, detailed attack methodologies, or other high-risk domains; keyword-based filters to exclude documents containing specific terminology or instructions of concern; and machine learning models trained to recognize subtle patterns in content that might contribute to dangerous capabilities.
1.1.1 Training DataCapability Limitation Mitigations > 2.2 Exploratory Methods
Beyond data filtering, researchers are investigating additional capability limitation approaches. However, these methods face technical challenges, and their effectiveness remains uncertain. ● Model distillation could create specialized versions of frontier models with capabilities limited to specific domains. For example, a model could excel at medical diagnosis while lacking knowledge needed for biological weapons development. While the capability limitations may be more fundamental than post-hoc safety training, it remains unclear how effectively this approach prevents harmful capabilities from being reconstructed. Additionally, multiple specialized models would be needed to cover various use cases, increasing development and maintenance costs. ● Targeted unlearning attempts to remove specific dangerous capabilities from models after initial training, offering a more precise alternative to full retraining. Possible approaches include fine-tuning on datasets to overwrite specific knowledge while preserving general capabilities, or modifying how models internally structure and access particular information. However, these methods may be reversible with relatively modest effort – restoring "unlearned" capabilities through targeted fine-tuning with small datasets. Models may also regenerate removed knowledge by inferring from adjacent information that remains accessible. While research continues on these approaches, developers currently rely more heavily on post-deployment mitigations that can be more reliably implemented and assessed.
1.1.3 Capability ModificationFrontier Mitigations
Frontier Model Forum (2025)
Frontier mitigations are protective measures implemented on frontier models, with the goal of reducing the risk of potential high-severity harms, especially those related to national security and public safety, that could arise from their advanced capabilities. This report discusses emerging industry practices for implementing and assessing frontier mitigations. It focuses on mitigations for managing risks in three primary domains: chemical, biological, radiological and nuclear (CBRN) information threats; advanced cyber threats; and advanced autonomous behavior threats. Given the nascent state of frontier mitigations, this report describes the range of controls and mitigation strategies being employed or researched by Frontier Model Forum members and documents the known limitations of these approaches.
Verify and Validate
Testing, evaluating, auditing, and red-teaming the AI system
Developer
Entity that creates, trains, or modifies the AI system
Measure
Quantifying, testing, and monitoring identified AI risks
Other