This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Structured analysis to identify, characterize, and prioritize potential harms and risks.
Also in Risk & Assurance
The primary goal of risk assessment is to understand the severity and likelihood of a potential harm. Risk assessments are used to prioritise risks and determine if they cross thresholds that demand specific action. Consequential development and deployment decisions are predicated on these assessments. The research areas in this category involve: A. Developing methods to measure the impact of AI systems for both current and future AI – This includes developing standardised assessments for risky behaviours of AI systems through audit techniques and benchmarks, evaluation and assessment of new capabilities, including potentially dangerous ones; and for real-world societal impact such as labour, misinformation and privacy through field tests and prospective risks analysis. B. Enhancing metrology to ensure that the measurements are precise and repeatable – This includes research in technical methods for quantitative risk assessment tailored to AI systems to reduce uncertainty and the need for large safety margins. This is an important open area of research. C. Building enablers for third-party audits to support independent validation of risk assessments – This includes developing secure infrastructure that enables thorough evaluation while protecting intellectual property, including preventing model theft.
Reasoning
Structured analysis identifies, characterizes, and prioritizes potential AI harms to inform deployment decisions.
Audit techniques and benchmarks
Techniques and benchmarks with which AI systems can be effectively and efficiently tested for harmful behaviours are highly varied and central to risk assessments (IAISR, Birhane-A).
3.2.1 Benchmarks & EvaluationDownstream impact assessment and forecasting
Assessing and forecasting the many societal impacts of AI systems is one of the most central goals of risk assessments.
2.2.1 Risk AssessmentSecure evaluation infrastructure
External auditors and oversight bodies need infrastructure and protocols that enable thorough evaluation while protecting sensitive intellectual property. Ideally, evaluation infrastructure should enable double-blindness: the evaluator’s inability to directly access the system’s parameters and developers’ inability to know what exact evaluations are run (Reuel, Bucknall-A, Casper-B). Meanwhile, the importance of mutual security will continue to grow as system capabilities and risks increase. Methods for developing secure infrastructure for auditing and oversight are known to be possible.
3.2.2 Technical StandardsSystem safety assessment
Safety assessment is not just about individual AI systems, but also their interaction with the rest of the world. For example, when an AI company discovers concerning behaviour from their system, the resulting risks depend, in part, on having internal processes in place to escalate the issue to senior leadership and work to mitigate the risks. System safety considers both AI systems and the broader context that they are deployed in. The study of system safety focuses on the interactions between different technical components as well as processes and incentives in an organisation (IAISR, Hendrycks-B, AISES, Alaga).
2.2.1 Risk AssessmentMetrology for AI risk assessment
Metrology, the science of measurement, has only recently been studied in the context of AI risk assessment (IAISR, Hobbhahn). Current approaches generally lack standardisation, repeatability, and precision.
3.2.1 Benchmarks & EvaluationDangerous capability and propensity assessment
To assess certain hazards posed by an AI system, it is necessary to elicit and assess potentially dangerous capabilities (Phuong, Shevlane, Anthropic-B, IAISR) including dual-use cyber, chemical, biological, and nuclear knowledge, as well as capabilities for psychological manipulation, AI research and development, and autonomy which increases the risk of loss of control (see below). To assess the likelihood that these capabilities will cause harm, it is necessary to assess the system’s propensities to use them.
2.2.2 Testing & EvaluationLoss-of-control risk assessment
Loss of control refers to scenarios where advanced AI systems – such as AGI– come to operate outside of human control, with no clear path to regaining control. This includes both scenarios that involve passively ceding control and scenarios that involve AI systems actively undermining control measures in pursuit of their own goals. Assessing this risk depends heavily on assessing and forecasting AI’s control-undermining capabilities. These include AI agency (autonomous action and planning), oversight evasion, persuasion, autonomously earning or seizing financial and computing resources, conducting cyber attacks, as well as AI research and development (IAISR). Assessments of control loss risk also focus on understanding propensities – how often and why AI systems use their control-undermining capabilities against the preferences of their developers. Evidence for all of the above control-undermining capabilities is growing but current capabilities remain insufficient to allow a loss of control (IAISR). However, there is evidence of today’s AI systems using their limited control undermining capabilities in certain scenarios, e.g. to avoid being replaced (IAISR, Anthropic-C, OpenAI-D).
2.2.1 Risk AssessmentDeveloping Trustworthy, Secure and Reliable Systems
AI systems that are trustworthy, reliable and secure by design give people the confidence to embrace and adopt AI innovation. Following a classic safety engineering framework, the research areas in this category involves: A. Specifying and validating the desired behaviour – This includes technical methods to address the complex challenges in specifying system behaviours in a way that accurately captures the desired intent without causing undesired side effects, for both singlestakeholder settings (e.g. reward hacking, scalable methods to discover specification loopholes) and multi-stakeholder settings (e.g. balancing competing preferences, ethical and legal alignment). B. Designing a system that meets the specification – This covers techniques for training models – both closed and open weights – that are trustworthy (e.g. reducing confabulation, increasing robustness against tampering), alternative finetuning methods to make specific precise changes to an AI system (e.g. model editing), and methods to build AI systems in a way that are guaranteed to meet their specifications (e.g. verifiable programme synthesis, world models with formal guarantees). C. Verifying that the AI system meets its specification – This entails techniques to provide high-confidence assurances that an AI system meets its specifications (e.g. formal verification), including in novel contexts (e.g. robustness testing), as well as interpretability techniques to look into the black box to understand why the AI system behaves the way it does (e.g. mechanistic interpretability).
1 AI SystemDeveloping Trustworthy, Secure and Reliable Systems > Specification & Validation: Defining the system’s purpose
Specification involves defining desired system behaviour, whereas validation ensures that the specification meets the needs of the user, developer, or society – did I build the right system? In other words, specification and validation require confronting the complexity of defining objectives in a way that captures user or societal intent without omitting important constraints or causing undesired side effects, as well as dealing with disagreement and tradeoffs between diverse stakeholders.
2.2.2 Testing & EvaluationDeveloping Trustworthy, Secure and Reliable Systems > Design and implementation: Building the system
This section focuses on techniques to make systems that meet their specifications. The design and implementation process involves sourcing data, pretraining models, post-training models, and integrating them into AI systems.
2.4 Engineering & DevelopmentDeveloping Trustworthy, Secure and Reliable Systems > Verification: Assessing if the system works as specified
The research areas described in this section aim to assess the extent to which the built system (2.2) meets its specifications (2.1). This section discusses several broad types of techniques that can be used to provide evidence that a system is safe. In practice, the effectiveness of these methods is often limited by access and transparency, but they can play a central role in constructing AI safety cases: structured arguments for why systems pose an acceptably low level of risk (Clymer, Buhl).
2.2 Risk & AssuranceControl: Monitoring & Intervention
In engineering, “control” usually refers to the process of managing a system’s behaviour to achieve a desired outcome, even when faced with disturbances or uncertainties, and often in a feedback loop. The research areas in this category involve: A. Developing monitoring and intervention mechanisms for AI systems – This includes adapting conventional methods for monitoring (e.g. hardware -enabled mechanisms, user monitoring) and intervention (e.g. off-switches, override protocols), as well as designing new techniques for controlling very powerful AI systems that may actively undermine attempts to control them (e.g. scalable oversight, containment). B. Extending monitoring mechanisms to the broader AI ecosystem to which the AI system belongs – This entails methods to support the identification and tracking of AI systems and data (e.g. logging infrastructure, data provenance, model provenance). In turn, this can facilitate accountability infrastructure and enable more informed governance. C. Societal resilience research to strengthen societal infrastructure against AI-enabled disruption and misuse – This section studies how institutions and norms (e.g. economic, security) can adapt as future AI systems come to act as autonomous entities, as well as incident response mechanisms to enable clear and rapid coordination among relevant actors to detect, respond to, and recover from accidents or misuse of advanced AI systems.
1.2 Non-ModelControl: Monitoring & Intervention > AI system monitoring
In emerging technical fields, new systems cannot be reasonably expected to always behave as intended and impact society as expected. Monitoring techniques play an essential role in the iterative process of identifying, understanding, and fixing problems as they emerge.
1.2.3 Monitoring & DetectionThe Singapore Consensus on Global AI Safety Research Priorities
Bengio, Yoshua; Maharaj, Tegan; Ong, C.-H. Luke; Russell, Stuart D.; Song, Dawn; Tegmark, Max; Lan, Xue; Zhang, Ya-Qin; Casper, Stephen; Lee, Wan Sie; Mindermann, Sören; Wilfred, Vanessa; Balachandran, Vidhisha; Barez, Fazl; Belinsky, Michael; Bello, Imane; Bourgon, Malo; Brakel, Mark; Campos, Siméon; Cass-Beggs, Duncan; Chen, Jiahao; Chowdhury, Rumman; Seah, Kuan Chua; Clune, Jeff; Dai, Jie; Delaborde, Agnes; Dziri, Nouha; Eiras, Francisco; Engels, Joshua; Fan, Jinyu; Gleave, Adam; Goodman, Noah D.; Heide, Fynn; Heidecke, Johannes; Hendrycks, Dan; Hodes, Cyrus; Hsiang, Bryan Low Kian; Huang, Minlie; Jawhar, Sami; Wang, Jingyu; Kalai, Adam Tauman; Kamphuis, Meindert; Kankanhalli, Mohan; Kantamneni, Subhash; Kirk, M.; Kwa, Thomas; Ladish, Jeffrey; Lam, Kwok-Yan; Lee, Wan Sie; Lee, Taewhi; Li, Xiaopeng; Liu, Jiajun; Lu, Ching-Cheng; Mai, Yifan; Mallah, Richard; Michael, Julian; Moës, Nick; Møller, Simon Geir; Nam, K. H.; Ng, TP; Nitzberg, Mark; Nushi, Besmira; Ó hÉigeartaigh, Seán; Ortega, Alejandro; Peigné, Pierre; Petrie, J. Howard; Prud'homme, Benjamin; Rabbany, Reihaneh; Sanchez-Pi, Nayat; Schwettmann, Sarah; Shlegeris, Buck; Siddiqui, Saad; Sinha, Ashish; Soto, Martín; Tan, Cheston; Dong, Ting; Tjhi, William; Trager, Robert; Tse, Brian; Tung, Anthony K. H.; Willes, John; Wong, David; Xu, Wei; Xu, Rong; Zeng, Yi; Zhang, Hao; Žikelić, Djordje (2025)
This is the first International AI Safety Report. Following an interim publication in May 2024, a diverse group of 96 Artificial Intelligence (AI) experts contributed to this first full report, including an international Expert Advisory Panel nominated by 30 countries, the Organisation for Economic Co-operation and Development (OECD), the European Union (EU), and the United Nations (UN). The report aims to provide scientific information that will support informed policymaking. It does not recommend specific policies…. This report summarises the scientific evidence on the safety of general-purpose AI. The purpose of this report is to help create a shared international understanding of risks from advanced AI and how they can be mitigated. To achieve this, this report focuses on general-purpose AI – or AI that can perform a wide variety of tasks – since this type of AI has advanced particularly rapidly in recent years and has been deployed widely by technology companies for a range of consumer and business purposes. The report synthesises the state of scientific understanding of general-purpose AI, with a focus on understanding and managing its risks. Amid rapid advancements, research on general-purpose AI is currently in a time of scientific discovery, and – in many cases – is not yet settled science. The report provides a snapshot of the current scientific understanding of general-purpose AI and its risks. This includes identifying areas of scientific consensus and areas where there are different views or gaps in the current scientific understanding. People around the world will only be able to fully enjoy the potential benefits of general- purpose AI safely if its risks are appropriately managed. This report focuses on identifying those risks and evaluating technical methods for assessing and mitigating them, including ways that general-purpose AI itself can be used to mitigate risks.
Plan and Design
Designing the AI system, defining requirements, and planning development
Governance Actor
Regulator, standards body, or oversight entity shaping AI policy
Measure
Quantifying, testing, and monitoring identified AI risks
Other