This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Technical mechanisms and engineering interventions that directly modify how an AI system processes inputs, generates outputs, or operates, including changes to models, training procedures, runtime behaviors, and supporting hardware.
Intervention techniques complement monitoring tools by offering various strategies to act on systems in ways that reduce risks from harmful behaviours
Reasoning
Description too abstract—"various strategies to act on systems" lacks specificity to identify mechanism or implementation level.
Hardware-enabled mechanisms
Tools built into hardware could be used to enforce requirements about what can be run and by whom on specialised hardware (RAND). For example, hardware mechanisms could be used to block or halt certain jobs from being run on hardware if they fail an authentication process. However, the main barrier to their use continues to be the engineering challenge of implementing and integrating these mechanisms. If implemented successfully, hardware-enabled mechanisms could play a unique role in verifying compliance, even for international agreements and across borders (Brundage, IAISR).
1.2.4 Security InfrastructureOff-switches
An “off switch” refers to a mechanism that allows for the effective shutdown of a system. Shutdowns can be challenging for multiple reasons, including the distributed nature of systems and the need to pass critical tasks (e.g. driving) onto specialised risk-mitigating systems after a shutdown. Key challenges centre on the problems of implementing and integrating reliable mechanisms. Additional challenges for off-switches could be posed by systems that actively take actions that prevent humans from shutting them down. These will be discussed in 3.1.3.
1.2.2 Runtime EnvironmentOverride protocols
Intervention procedures that replace harmful systems or system outputs with safe ones offer a final failsafe against harmful behaviour. For example, systems making high-stakes decisions may require a human in the loop to make key choices, and it may be either the system or a human that prompts for human intervention. The principal challenge and barrier to the effective use of overrides lies in the design of systems that effectively balance efficiency with safety. Toward this end, it will also be useful to define measures for measuring to what numerical degree a system is under the meaningful control of human operators.
1.2.9 OtherIncident and emergency preparedness and response
Research is needed on protocols for rapid response and reporting for incidents without introducing new vulnerabilities (Wasil-A, Wasil-B). Technical and organisational questions remain about isolating compromised components while maintaining critical functions. A significant challenge involves verifying emergency response mechanisms against human error and potential exploitation by increasingly capable AI systems.
2.3.4 Incident ResponseRisk Assessment
The primary goal of risk assessment is to understand the severity and likelihood of a potential harm. Risk assessments are used to prioritise risks and determine if they cross thresholds that demand specific action. Consequential development and deployment decisions are predicated on these assessments. The research areas in this category involve: A. Developing methods to measure the impact of AI systems for both current and future AI – This includes developing standardised assessments for risky behaviours of AI systems through audit techniques and benchmarks, evaluation and assessment of new capabilities, including potentially dangerous ones; and for real-world societal impact such as labour, misinformation and privacy through field tests and prospective risks analysis. B. Enhancing metrology to ensure that the measurements are precise and repeatable – This includes research in technical methods for quantitative risk assessment tailored to AI systems to reduce uncertainty and the need for large safety margins. This is an important open area of research. C. Building enablers for third-party audits to support independent validation of risk assessments – This includes developing secure infrastructure that enables thorough evaluation while protecting intellectual property, including preventing model theft.
2.2.1 Risk AssessmentRisk Assessment > Audit techniques and benchmarks
Techniques and benchmarks with which AI systems can be effectively and efficiently tested for harmful behaviours are highly varied and central to risk assessments (IAISR, Birhane-A).
3.2.1 Benchmarks & EvaluationRisk Assessment > Downstream impact assessment and forecasting
Assessing and forecasting the many societal impacts of AI systems is one of the most central goals of risk assessments.
2.2.1 Risk AssessmentRisk Assessment > Secure evaluation infrastructure
External auditors and oversight bodies need infrastructure and protocols that enable thorough evaluation while protecting sensitive intellectual property. Ideally, evaluation infrastructure should enable double-blindness: the evaluator’s inability to directly access the system’s parameters and developers’ inability to know what exact evaluations are run (Reuel, Bucknall-A, Casper-B). Meanwhile, the importance of mutual security will continue to grow as system capabilities and risks increase. Methods for developing secure infrastructure for auditing and oversight are known to be possible.
3.2.2 Technical StandardsRisk Assessment > System safety assessment
Safety assessment is not just about individual AI systems, but also their interaction with the rest of the world. For example, when an AI company discovers concerning behaviour from their system, the resulting risks depend, in part, on having internal processes in place to escalate the issue to senior leadership and work to mitigate the risks. System safety considers both AI systems and the broader context that they are deployed in. The study of system safety focuses on the interactions between different technical components as well as processes and incentives in an organisation (IAISR, Hendrycks-B, AISES, Alaga).
2.2.1 Risk AssessmentRisk Assessment > Metrology for AI risk assessment
Metrology, the science of measurement, has only recently been studied in the context of AI risk assessment (IAISR, Hobbhahn). Current approaches generally lack standardisation, repeatability, and precision.
3.2.1 Benchmarks & EvaluationThe Singapore Consensus on Global AI Safety Research Priorities
Bengio, Yoshua; Maharaj, Tegan; Ong, C.-H. Luke; Russell, Stuart D.; Song, Dawn; Tegmark, Max; Lan, Xue; Zhang, Ya-Qin; Casper, Stephen; Lee, Wan Sie; Mindermann, Sören; Wilfred, Vanessa; Balachandran, Vidhisha; Barez, Fazl; Belinsky, Michael; Bello, Imane; Bourgon, Malo; Brakel, Mark; Campos, Siméon; Cass-Beggs, Duncan; Chen, Jiahao; Chowdhury, Rumman; Seah, Kuan Chua; Clune, Jeff; Dai, Jie; Delaborde, Agnes; Dziri, Nouha; Eiras, Francisco; Engels, Joshua; Fan, Jinyu; Gleave, Adam; Goodman, Noah D.; Heide, Fynn; Heidecke, Johannes; Hendrycks, Dan; Hodes, Cyrus; Hsiang, Bryan Low Kian; Huang, Minlie; Jawhar, Sami; Wang, Jingyu; Kalai, Adam Tauman; Kamphuis, Meindert; Kankanhalli, Mohan; Kantamneni, Subhash; Kirk, M.; Kwa, Thomas; Ladish, Jeffrey; Lam, Kwok-Yan; Lee, Wan Sie; Lee, Taewhi; Li, Xiaopeng; Liu, Jiajun; Lu, Ching-Cheng; Mai, Yifan; Mallah, Richard; Michael, Julian; Moës, Nick; Møller, Simon Geir; Nam, K. H.; Ng, TP; Nitzberg, Mark; Nushi, Besmira; Ó hÉigeartaigh, Seán; Ortega, Alejandro; Peigné, Pierre; Petrie, J. Howard; Prud'homme, Benjamin; Rabbany, Reihaneh; Sanchez-Pi, Nayat; Schwettmann, Sarah; Shlegeris, Buck; Siddiqui, Saad; Sinha, Ashish; Soto, Martín; Tan, Cheston; Dong, Ting; Tjhi, William; Trager, Robert; Tse, Brian; Tung, Anthony K. H.; Willes, John; Wong, David; Xu, Wei; Xu, Rong; Zeng, Yi; Zhang, Hao; Žikelić, Djordje (2025)
This is the first International AI Safety Report. Following an interim publication in May 2024, a diverse group of 96 Artificial Intelligence (AI) experts contributed to this first full report, including an international Expert Advisory Panel nominated by 30 countries, the Organisation for Economic Co-operation and Development (OECD), the European Union (EU), and the United Nations (UN). The report aims to provide scientific information that will support informed policymaking. It does not recommend specific policies…. This report summarises the scientific evidence on the safety of general-purpose AI. The purpose of this report is to help create a shared international understanding of risks from advanced AI and how they can be mitigated. To achieve this, this report focuses on general-purpose AI – or AI that can perform a wide variety of tasks – since this type of AI has advanced particularly rapidly in recent years and has been deployed widely by technology companies for a range of consumer and business purposes. The report synthesises the state of scientific understanding of general-purpose AI, with a focus on understanding and managing its risks. Amid rapid advancements, research on general-purpose AI is currently in a time of scientific discovery, and – in many cases – is not yet settled science. The report provides a snapshot of the current scientific understanding of general-purpose AI and its risks. This includes identifying areas of scientific consensus and areas where there are different views or gaps in the current scientific understanding. People around the world will only be able to fully enjoy the potential benefits of general- purpose AI safely if its risks are appropriately managed. This report focuses on identifying those risks and evaluating technical methods for assessing and mitigating them, including ways that general-purpose AI itself can be used to mitigate risks.
Operate and Monitor
Running, maintaining, and monitoring the AI system post-deployment
Deployer
Entity that integrates and deploys the AI system for end users
Manage
Prioritising, responding to, and mitigating AI risks
Primary
7 AI System Safety, Failures & Limitations