This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Foundational safety research, theoretical understanding, and scientific inquiry informing AI development.
Also in Engineering & Development
Ensuring that AI systems are understandable, trustworthy, and transparent. This involves developing tools and methods to interpret model internals, refining the reliability and scalability of interpretability techniques, exploring ways to elicit and explain model reasoning, and improving the transparency of complex models.
Reasoning
Develops foundational interpretability research and methods to understand model internals and reasoning mechanisms.
Interpretability foundations
Focuses on theoretical and experimental studies investigating how models represent and encode concepts, emphasizing structural and abstraction-level insights, including by distinguishing linear from non-linear encodings, understanding polysemanticity and superposition, examining concept mismatches between models and humans, and discovering more accurate abstractions for interpretability.
2.4.1 Research & FoundationsValidating and applying interpretability methods
Developing rigorous criteria and benchmarks for evaluating the reliability of interpretability methods, and understanding whether these methods maintain their validity when applied to actively modify model behavior.
3.2.1 Benchmarks & EvaluationFeature and circuit analysis
Creating scalable approaches for feature interpretation, circuit discovery, and feature steering (or top-down/control vectors).
2.4.1 Research & FoundationsEliciting Latent Knowledge (ELK)
Developing methods to reveal hidden knowledge embedded within models, enabling researchers to identify what models implicitly “know” about the world and how this knowledge influences predictions.
2.4.1 Research & FoundationsDevelopmental interpretability
Investigating how AI models' internal representations and behaviors evolve throughout their training process to understand the developmental stages and mechanisms by which complex capabilities emerge. This research aims to uncover the progressive changes in model structure and function, facilitating better alignment and safety assurances.
2.4.1 Research & FoundationsTransparency
Research here aims to open up the black box of AI systems by uncovering how data, architecture, and training processes shape model outputs.
2.4.1 Research & FoundationsExplainability
Methods to understand why a model generates specific outputs. Technical approaches include developing post-hoc or embedded explanation methods, measuring and improving explanation fidelity, and crafting user-focused interfaces that clarify causal or logical relationships.
1.1.4 Model ArchitectureTheoretical foundations and provable safety in AI systems
Advancing the theoretical foundations of AI safety by building models and frameworks that ensure provably correct and robust behavior. These efforts span from verifiable architectures and formal verification methods to embedded agency, decision theory, incentive structures aligned with causal reasoning, and control theory.
2.4.1 Research & FoundationsTheoretical foundations and provable safety in AI systems > Building verifiable and robust AI architectures
Constructing AI systems with architectures that support formal verification and robustness guarantees, such as world models that enable safe and reliable planning, or guaranteed safe AI with Bayesian oracles. This area emphasizes simplicity and transparency to aid in provability.
1.1.4 Model ArchitectureTheoretical foundations and provable safety in AI systems > Formal verification of AI systems
Applying formal methods to verify that AI models and algorithms meet stringent safety, robustness, and performance criteria. This includes proving resilience against adversarial inputs and perturbations, and certifying conformance to specified safety properties under varying conditions.
2.2.2 Testing & EvaluationTheoretical foundations and provable safety in AI systems > Decision theory and rational agency
Establishing formal decision-making frameworks that ensure rational and safe choices by AI agents, potentially drawing on concepts like causal and evidential decision theory.
2.4.1 Research & FoundationsTheoretical foundations and provable safety in AI systems > Embedded agency
Explores how agents can model and reason about themselves and their environment as interconnected parts of a single system, addressing challenges like self-reference, resource constraints, and the stability of reasoning processes. This includes tackling problems arising from the lack of a clear boundary between the agent and its environment.
2.4.1 Research & FoundationsTheoretical foundations and provable safety in AI systems > Causal incentives
Developing frameworks that formalize how to align agent incentives with safe and desired outcomes by ensuring their causal understanding matches intended objectives. This research provides a formal language for guaranteeing safety, addressing challenges like goal misspecification, and complementing broader efforts in agent foundations and robust system design.
2.4.1 Research & FoundationsExpert Survey: AI Reliability & Security Research Priorities
O'Brien, Joe; Dolan, Jeremy; Kim, Jay; Dykhuizen, Jonah; Sania, Jeba; Becker, Sebastian; Kraprayoon, Jam; Labrador, Cara (2025)
Our survey of 53 specialists across 105 AI reliability and security research areas identifies the most promising research prospects to guide strategic AI R&D investment. As companies are seeking to develop AI systems with broadly human-level capabilities, research on reliability and security is urgently needed to ensure AI's benefits can be safely and broadly realized and prevent severe harms. This study is the first to quantify expert priorities across a comprehensive taxonomy of AI safety and security research directions and to produce a data-driven ranking of their potential impact. These rankings may support evidence-based decisions about how to effectively deploy resources toward AI reliability and security research.
Other (outside lifecycle)
Outside the standard AI system lifecycle
Developer
Entity that creates, trains, or modifies the AI system
Measure
Quantifying, testing, and monitoring identified AI risks