Verification: Assessing if the system wo…

BackSingle-Stakeholder Settings

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Single-Stakeholder Settings

Bengio (2025)|LLM classified

Mitigation Taxonomy

1AI System

1.1Model

1.1.2Learning Objectives

Training methods that shape model behavior through objectives, feedback, and optimization targets.

Also in Model

1.1.1 Training Data1.1.3 Capability Modification1.1.4 Model Architecture

Definitionp. 17

A key specification/validation challenge is to develop faithful methods to translate human oversight into automated systems: How can we design processes for developing automated AI proxies for humans based on human feedback and demonstrations?

LLM Classification Details

Reasoning

Foundational research investigates how to develop AI systems that learn from human feedback and oversight.

Code: 2.4.1Version: v0.5Classified: Jan 22, 2026

Part of

Specification & Validation: Defining the system’s purpose

Specification involves defining desired system behaviour, whereas validation ensures that the specification meets the needs of the user, developer, or society – did I build the right system? In other words, specification and validation require confronting the complexity of defining objectives in a way that captures user or societal intent without omitting important constraints or causing undesired side effects, as well as dealing with disagreement and tradeoffs between diverse stakeholders.

Sub-mitigations (4)

Avoiding reward hacking and unintended consequences

Even in a simple setting with one human’s well-defined and fixed preferences, subtle mis-specifications can yield unacceptable results if the AI system optimises rigidly for the literal specification rather than the user’s true intent.

1.1.2 Learning Objectives

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Defining clear boundaries for acceptable behaviour

When designing frontier AI systems, it is difficult to precisely define the boundaries between acceptable and unacceptable behaviour. Many of these challenges stem from the dual use nature of information. For example, some biology lab protocols are useful for both benign and harmful bioengineering experiments. Defining acceptable behaviours is made further challenging by how some harmful tasks can be decomposed into individually-benign subtasks (e.g. Jones). Effectively defining safe behavioural boundaries and ensuring that systems can learn them is an ongoing challenge which requires an extensive understanding of emerging AI misuse threats.

2.2.9 Other

Lifecycle:Plan and DesignActor:DeveloperAIRM:Govern

Scalable methods to discover specification loopholes

How can we systematically identify subtle flaws in the specification that only appear under unusual or adversarial conditions? How can we progressively lessen those ambiguities or malspecified edge cases via active learning or goal ensembling techniques? Typically, red-teaming work is associated with assessing a final system, but frontiers for future work on specification loopholes may involve developing adversarial red-teaming techniques to stress test specifications.

2.2.2 Testing & Evaluation

Lifecycle:Plan and DesignActor:DeveloperAIRM:Map

Conflicting and evolving preferences

Even in a single-stakeholder setting, it is challenging to fully align with a single human’s values. Existing approaches to developing aligned AI assume a human who has fixed, stable, and consistent preferences. However, human preferences are complex, dynamic, context-dependent, and sometimes self-contradictory, making it fundamentally difficult to align systems with even a single human (Armstrong, Casper-A). Directions for continued work include designing systems that continue to learn and adapt to changes in user preferences, implementing normatively-accepted ways of resolving conflicts between preferences, as well as methods to help human users meaningfully analyse and update their preferences over time.

1.1.2 Learning Objectives

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Other mitigations from Bengio (2025) (76)

Risk Assessment

The primary goal of risk assessment is to understand the severity and likelihood of a potential harm. Risk assessments are used to prioritise risks and determine if they cross thresholds that demand specific action. Consequential development and deployment decisions are predicated on these assessments. The research areas in this category involve: A. Developing methods to measure the impact of AI systems for both current and future AI – This includes developing standardised assessments for risky behaviours of AI systems through audit techniques and benchmarks, evaluation and assessment of new capabilities, including potentially dangerous ones; and for real-world societal impact such as labour, misinformation and privacy through field tests and prospective risks analysis. B. Enhancing metrology to ensure that the measurements are precise and repeatable – This includes research in technical methods for quantitative risk assessment tailored to AI systems to reduce uncertainty and the need for large safety margins. This is an important open area of research. C. Building enablers for third-party audits to support independent validation of risk assessments – This includes developing secure infrastructure that enables thorough evaluation while protecting intellectual property, including preventing model theft.

2.2.1 Risk Assessment

Lifecycle:Plan and DesignActor:Governance ActorAIRM:Measure

Risk Assessment > Audit techniques and benchmarks

Techniques and benchmarks with which AI systems can be effectively and efficiently tested for harmful behaviours are highly varied and central to risk assessments (IAISR, Birhane-A).

3.2.1 Benchmarks & Evaluation

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Measure

Risk Assessment > Downstream impact assessment and forecasting

Assessing and forecasting the many societal impacts of AI systems is one of the most central goals of risk assessments.

2.2.1 Risk Assessment

Lifecycle:Plan and DesignActor:Governance ActorAIRM:Map

Risk Assessment > Secure evaluation infrastructure

External auditors and oversight bodies need infrastructure and protocols that enable thorough evaluation while protecting sensitive intellectual property. Ideally, evaluation infrastructure should enable double-blindness: the evaluator’s inability to directly access the system’s parameters and developers’ inability to know what exact evaluations are run (Reuel, Bucknall-A, Casper-B). Meanwhile, the importance of mutual security will continue to grow as system capabilities and risks increase. Methods for developing secure infrastructure for auditing and oversight are known to be possible.

3.2.2 Technical Standards

Lifecycle:Verify and ValidateActor:Governance ActorAIRM:Measure

Risk Assessment > System safety assessment

Safety assessment is not just about individual AI systems, but also their interaction with the rest of the world. For example, when an AI company discovers concerning behaviour from their system, the resulting risks depend, in part, on having internal processes in place to escalate the issue to senior leadership and work to mitigate the risks. System safety considers both AI systems and the broader context that they are deployed in. The study of system safety focuses on the interactions between different technical components as well as processes and incentives in an organisation (IAISR, Hendrycks-B, AISES, Alaga).

2.2.1 Risk Assessment

Lifecycle:Other (multiple stages)Actor:DeployerAIRM:Measure

Risk Assessment > Metrology for AI risk assessment

Metrology, the science of measurement, has only recently been studied in the context of AI risk assessment (IAISR, Hobbhahn). Current approaches generally lack standardisation, repeatability, and precision.

3.2.1 Benchmarks & Evaluation

Lifecycle:Other (outside lifecycle)Actor:Other (multiple actors)AIRM:Measure

View all 76 mitigations from this source →

Source Document

The Singapore Consensus on Global AI Safety Research Priorities

Bengio, Yoshua; Maharaj, Tegan; Ong, C.-H. Luke; Russell, Stuart D.; Song, Dawn; Tegmark, Max; Lan, Xue; Zhang, Ya-Qin; Casper, Stephen; Lee, Wan Sie; Mindermann, Sören; Wilfred, Vanessa; Balachandran, Vidhisha; Barez, Fazl; Belinsky, Michael; Bello, Imane; Bourgon, Malo; Brakel, Mark; Campos, Siméon; Cass-Beggs, Duncan; Chen, Jiahao; Chowdhury, Rumman; Seah, Kuan Chua; Clune, Jeff; Dai, Jie; Delaborde, Agnes; Dziri, Nouha; Eiras, Francisco; Engels, Joshua; Fan, Jinyu; Gleave, Adam; Goodman, Noah D.; Heide, Fynn; Heidecke, Johannes; Hendrycks, Dan; Hodes, Cyrus; Hsiang, Bryan Low Kian; Huang, Minlie; Jawhar, Sami; Wang, Jingyu; Kalai, Adam Tauman; Kamphuis, Meindert; Kankanhalli, Mohan; Kantamneni, Subhash; Kirk, M.; Kwa, Thomas; Ladish, Jeffrey; Lam, Kwok-Yan; Lee, Wan Sie; Lee, Taewhi; Li, Xiaopeng; Liu, Jiajun; Lu, Ching-Cheng; Mai, Yifan; Mallah, Richard; Michael, Julian; Moës, Nick; Møller, Simon Geir; Nam, K. H.; Ng, TP; Nitzberg, Mark; Nushi, Besmira; Ó hÉigeartaigh, Seán; Ortega, Alejandro; Peigné, Pierre; Petrie, J. Howard; Prud'homme, Benjamin; Rabbany, Reihaneh; Sanchez-Pi, Nayat; Schwettmann, Sarah; Shlegeris, Buck; Siddiqui, Saad; Sinha, Ashish; Soto, Martín; Tan, Cheston; Dong, Ting; Tjhi, William; Trager, Robert; Tse, Brian; Tung, Anthony K. H.; Willes, John; Wong, David; Xu, Wei; Xu, Rong; Zeng, Yi; Zhang, Hao; Žikelić, Djordje (2025)

This is the ﬁrst International AI Safety Report. Following an interim publication in May 2024, a diverse group of 96 Artiﬁcial Intelligence (AI) experts contributed to this ﬁrst full report, including an international Expert Advisory Panel nominated by 30 countries, the Organisation for Economic Co-operation and Development (OECD), the European Union (EU), and the United Nations (UN). The report aims to provide scientiﬁc information that will support informed policymaking. It does not recommend speciﬁc policies…. This report summarises the scientiﬁc evidence on the safety of general-purpose AI. The purpose of this report is to help create a shared international understanding of risks from advanced AI and how they can be mitigated. To achieve this, this report focuses on general-purpose AI – or AI that can perform a wide variety of tasks – since this type of AI has advanced particularly rapidly in recent years and has been deployed widely by technology companies for a range of consumer and business purposes. The report synthesises the state of scientiﬁc understanding of general-purpose AI, with a focus on understanding and managing its risks. Amid rapid advancements, research on general-purpose AI is currently in a time of scientiﬁc discovery, and – in many cases – is not yet settled science. The report provides a snapshot of the current scientiﬁc understanding of general-purpose AI and its risks. This includes identifying areas of scientiﬁc consensus and areas where there are different views or gaps in the current scientiﬁc understanding. People around the world will only be able to fully enjoy the potential beneﬁts of general- purpose AI safely if its risks are appropriately managed. This report focuses on identifying those risks and evaluating technical methods for assessing and mitigating them, including ways that general-purpose AI itself can be used to mitigate risks.

View source DOI: 10.48550/arXiv.2506.20702

Classification

AI Lifecycle Stage

Plan and Design

Designing the AI system, defining requirements, and planning development

Responsible Actor

Developer

Entity that creates, trains, or modifies the AI system

NIST AI RMF Function

Measure

Quantifying, testing, and monitoring identified AI risks

Manage

Risk Domains

Primary

7.1 AI pursuing its own goals in conflict with human goals or values

Other

5.2 Loss of human agency and autonomy