This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Red teaming, capability evaluations, adversarial testing, and performance verification.
Also in Risk & Assurance
Stage: Detection; Stakeholder: Third Party Researchers; Additional information: Techniques that monitor AI model internals in addition to outputs have shown promise in detecting deception (Goldowsky-Dill et al. 2025). Developers and researchers should continue improving adversarial techniques, sharing results and developing standardised benchmarks to assess autonomy and other capabilities (Barnett & Thiergart 2024; Greenblatt, Shlegeris et al. 2024).
Reasoning
Red teaming and adversarial testing evaluate system capabilities and identify vulnerabilities through structured assessment.
Monitor critical capability levels
2.2.2 Testing & EvaluationIdentify early warning signs and emergent capabilities
2.2.1 Risk AssessmentEstablish standardised benchmarks and reporting
3.2.1 Benchmarks & EvaluationImplement compute monitoring and anomaly detection
1.2.3 Monitoring & DetectionEnhance hardware and supply chain oversight
2.3.3 Monitoring & LoggingLead efforts to establish shared criteria for AI LOC
3.2.2 Technical StandardsStrengthening Emergency Preparedness and Response for AI Loss of Control Incidents
Somani, Elika; Friedman, Anjay; Wu, Henry; Lu, Marianne; Byrd, Christopher; van Soest, Henri; Zakaria, Sana (2025)
As artificial intelligence (AI) systems become increasingly embedded in essential infrastructure and services, the risks associated with unintended failures rise. Developing comprehensive emergency response protocols could help mitigate these significant risks. This report focuses on understanding and addressing AI loss of control (LOC) scenarios where human oversight fails to adequately constrain an autonomous, general-purpose AI.
Verify and Validate
Testing, evaluating, auditing, and red-teaming the AI system
Other
Actor type not captured by the standard categories
Measure
Quantifying, testing, and monitoring identified AI risks