Evaluating tool affordances for LLM-agen…

BackMulti-agent interactions

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Multi-agent interactions

O'Brien (2025)|LLM classified

Mitigation Taxonomy

2Organisation

2.4Engineering & Development

2.4.1Research & Foundations

Foundational safety research, theoretical understanding, and scientific inquiry informing AI development.

Also in Engineering & Development

2.4.2 Design Standards2.4.3 Development Workflows2.4.4 Training & Awareness

Definitionp. 33

Research focusing on ensuring safe multi-agent interactions, such as by detecting and preventing malicious collective behaviors, studying how transparency can affect agent interactions, and developing evaluations for agent behavior and interaction.

LLM Classification Details

Reasoning

Foundational research investigating safe multi-agent interactions, collective behaviors, and developing evaluation methods.

Code: 2.4.1Version: v0.5Classified: Jan 22, 2026

Sub-mitigations (6)

Safety and emergent functionality in multi-agent interactions

Understanding how individual agent dispositions and capabilities scale into complex multi-agent dynamics, evaluating emergent functionalities (e.g., coordinated strategies), enhancing robustness of LLM agents to correlated failures stemming from foundationality, and applying insights from multi-agent RL research to LLM-based systems.

2.4.1 Research & Foundations

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Detecting and preventing collusion and emergent collective behavior

Developing detection techniques (e.g., information-theoretic or interpretability-based) for collusion between AI agents, benchmarking and evaluating collusive tendencies, designing mitigation strategies such as oversight regimes, communication restrictions, and methods for steering agents, understanding conditions (e.g., agent similarity, communication channels, environment structure) that facilitate collusion, and understanding why and how general “super-agents” might develop from many narrow agents.

2.4.1 Research & Foundations

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Other

Multi-agent security

Assessing unique security risks that arise in multi-agent ecosystems, designing defenses (e.g., secure communication protocols, improved network architectures, information security), studying how multiple systems can circumvent safeguards, evaluating robustness of cooperation to adversarial attacks (e.g., if a small number of malicious agents can destabilize larger groups), evaluating how well agents can adversarially attack each other, and studying the impact of AI agent’s training dynamics on data generated by each other with respect to shared vulnerabilities/correlated failure modes.

2.2 Risk & Assurance

Lifecycle:Other (multiple stages)Actor:DeveloperAIRM:Other

Network effects and destabilizing dynamics in agent ecosystem

Understanding which network structures and interaction patterns lead to robust or fragile systems, monitoring and controlling dynamics and co-adaptation of networks of advanced agents, and identifying important security concerns in existing and future multi-agent application areas (e.g., finance, energy grids) and applying lessons from those areas to manage destabilizing forces.

2.4.1 Research & Foundations

Lifecycle:Operate and MonitorActor:Governance ActorAIRM:Map

Transparency, information asymmetries, and communication protocols

Studying how agent transparency (e.g., code access) or predictability of agents can influence cooperation or defection, scaling Bayesian persuasion and information design to complex multi-agent settings, developing secure information transmission methods between AI agents to promote cooperation,, examining how agent similarity and evidential reasoning about others affect ability and propensity to cooperate, and developing efficient algorithms for zero- or few-shot coordination in high-stakes scenarios.

2.4.1 Research & Foundations

Lifecycle:Plan and DesignActor:DeveloperAIRM:Map

Multi-agent metrics and evaluations

Distinguishing and measuring cooperative dispositions, understanding agents’ robustness against coercion or exploitation, quantifying traits like altruism or spite, assessing the impact of capability asymmetries between agents, examining how training processes and data sources influence cooperation, and developing dangerous capability evaluations for multi-agent systems.

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Other mitigations from O'Brien (2025) (122)

Theoretical foundations and provable safety in AI systems

Advancing the theoretical foundations of AI safety by building models and frameworks that ensure provably correct and robust behavior. These efforts span from verifiable architectures and formal verification methods to embedded agency, decision theory, incentive structures aligned with causal reasoning, and control theory.

2.4.1 Research & Foundations

Lifecycle:Plan and DesignActor:DeveloperAIRM:Manage

Theoretical foundations and provable safety in AI systems > Building verifiable and robust AI architectures

Constructing AI systems with architectures that support formal verification and robustness guarantees, such as world models that enable safe and reliable planning, or guaranteed safe AI with Bayesian oracles. This area emphasizes simplicity and transparency to aid in provability.

1.1.4 Model Architecture

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Theoretical foundations and provable safety in AI systems > Formal verification of AI systems

Applying formal methods to verify that AI models and algorithms meet stringent safety, robustness, and performance criteria. This includes proving resilience against adversarial inputs and perturbations, and certifying conformance to specified safety properties under varying conditions.

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Theoretical foundations and provable safety in AI systems > Decision theory and rational agency

Establishing formal decision-making frameworks that ensure rational and safe choices by AI agents, potentially drawing on concepts like causal and evidential decision theory.

2.4.1 Research & Foundations

Lifecycle:Plan and DesignActor:DeveloperAIRM:Manage

Theoretical foundations and provable safety in AI systems > Embedded agency

Explores how agents can model and reason about themselves and their environment as interconnected parts of a single system, addressing challenges like self-reference, resource constraints, and the stability of reasoning processes. This includes tackling problems arising from the lack of a clear boundary between the agent and its environment.

2.4.1 Research & Foundations

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Unable to classify

Theoretical foundations and provable safety in AI systems > Causal incentives

Developing frameworks that formalize how to align agent incentives with safe and desired outcomes by ensuring their causal understanding matches intended objectives. This research provides a formal language for guaranteeing safety, addressing challenges like goal misspecification, and complementing broader efforts in agent foundations and robust system design.

2.4.1 Research & Foundations

Lifecycle:Plan and DesignActor:DeveloperAIRM:Manage

View all 122 mitigations from this source →

Source Document

Expert Survey: AI Reliability & Security Research Priorities

O'Brien, Joe; Dolan, Jeremy; Kim, Jay; Dykhuizen, Jonah; Sania, Jeba; Becker, Sebastian; Kraprayoon, Jam; Labrador, Cara (2025)

Our survey of 53 specialists across 105 AI reliability and security research areas identifies the most promising research prospects to guide strategic AI R&D investment. As companies are seeking to develop AI systems with broadly human-level capabilities, research on reliability and security is urgently needed to ensure AI's benefits can be safely and broadly realized and prevent severe harms. This study is the first to quantify expert priorities across a comprehensive taxonomy of AI safety and security research directions and to produce a data-driven ranking of their potential impact. These rankings may support evidence-based decisions about how to effectively deploy resources toward AI reliability and security research.

View source DOI: 10.48550/arXiv.2505.21664

Classification

AI Lifecycle Stage

Other (outside lifecycle)

Outside the standard AI system lifecycle

Responsible Actor

Developer

Entity that creates, trains, or modifies the AI system

NIST AI RMF Function

Measure

Quantifying, testing, and monitoring identified AI risks

Manage

Risk Domains

Primary

7.6 Multi-agent risks

Other

7.2 AI possessing dangerous capabilities