Control theory applications in AI safety

BackTraining and fine-tuning methods for alignment and safety

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Training and fine-tuning methods for alignment and safety

O'Brien (2025)|LLM classified

Mitigation Taxonomy

1AI System

1.1Model

Changes to the model's learned parameters, architecture, or training process, including modifications to training data that affect what the model learns.

Also in AI System

1.2 Non-Model

Definitionp. 28

Developing reliable training and fine-tuning strategies for AI models to ensure that their outputs remain safe, interpretable, and aligned with intended goals. This involves understanding how fine-tuning affects model behavior, employing adversarial training for robust alignment, carefully adjusting pre-training processes, and improving data quality and auditing methods.

LLM Classification Details

Reasoning

Adversarial training and fine-tuning strategies shape model behavior through learning objectives and optimization targets for alignment.

Code: 1.1.2Version: v0.5Classified: Jan 22, 2026

Sub-mitigations (7)

Understanding how fine-tuning changes a pretrained model

Investigating how fine-tuning alters a model’s internal representations and behaviors to better predict, and ultimately control, downstream safety outcomes.

2.4.1 Research & Foundations

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Map

Develop output-based adversarial training techniques for more robust alignment

Developing training procedures, such as adversarial training focused on internal model representations, or ‘process supervision’, that directly optimize against adversarial examples and undesirable outputs, making models more resistant to manipulations that could lead to unsafe behaviors.

1.1.2 Learning Objectives

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Scalable techniques for targeted modifications of LLM behavior (including unlearning)

Creating scalable methods for precisely adjusting model outputs, such as removing unwanted content or refining responses to adhere to alignment constraints without broadly degrading performance. This may also include removal of unknown or latent undesirable capabilities that emerge in large models.

1.1.3 Capability Modification

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Retrieval-augmented pre-training

Incorporating retrieval mechanisms during pre-training to better ground models in verified information.

1.1.4 Model Architecture

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Pretraining alterations to improve interpretability

Altering pre-training protocols to produce models with clearer internal representations and decision-making pathways, allowing for more effective downstream analysis and intervention.

1.1.4 Model Architecture

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Limiting models’ ability to perform harmful tasks

Introducing mechanisms during pre-training that proactively limit a model’s potential to learn or perform harmful tasks, constraining the model’s capability space to safer domains before downstream fine-tuning.

1.1.3 Capability Modification

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Scalable data auditing, filtering, and Pretraining with Human Feedback (PHF)

Developing tools for large-scale data auditing, filtering, training-data attribution, and incorporating human feedback at the pre-training stage.

1.1 Model

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

Other mitigations from O'Brien (2025) (122)

Theoretical foundations and provable safety in AI systems

Advancing the theoretical foundations of AI safety by building models and frameworks that ensure provably correct and robust behavior. These efforts span from verifiable architectures and formal verification methods to embedded agency, decision theory, incentive structures aligned with causal reasoning, and control theory.

2.4.1 Research & Foundations

Lifecycle:Plan and DesignActor:DeveloperAIRM:Manage

Theoretical foundations and provable safety in AI systems > Building verifiable and robust AI architectures

Constructing AI systems with architectures that support formal verification and robustness guarantees, such as world models that enable safe and reliable planning, or guaranteed safe AI with Bayesian oracles. This area emphasizes simplicity and transparency to aid in provability.

1.1.4 Model Architecture

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Theoretical foundations and provable safety in AI systems > Formal verification of AI systems

Applying formal methods to verify that AI models and algorithms meet stringent safety, robustness, and performance criteria. This includes proving resilience against adversarial inputs and perturbations, and certifying conformance to specified safety properties under varying conditions.

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Theoretical foundations and provable safety in AI systems > Decision theory and rational agency

Establishing formal decision-making frameworks that ensure rational and safe choices by AI agents, potentially drawing on concepts like causal and evidential decision theory.

2.4.1 Research & Foundations

Lifecycle:Plan and DesignActor:DeveloperAIRM:Manage

Theoretical foundations and provable safety in AI systems > Embedded agency

Explores how agents can model and reason about themselves and their environment as interconnected parts of a single system, addressing challenges like self-reference, resource constraints, and the stability of reasoning processes. This includes tackling problems arising from the lack of a clear boundary between the agent and its environment.

2.4.1 Research & Foundations

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Unable to classify

Theoretical foundations and provable safety in AI systems > Causal incentives

Developing frameworks that formalize how to align agent incentives with safe and desired outcomes by ensuring their causal understanding matches intended objectives. This research provides a formal language for guaranteeing safety, addressing challenges like goal misspecification, and complementing broader efforts in agent foundations and robust system design.

2.4.1 Research & Foundations

Lifecycle:Plan and DesignActor:DeveloperAIRM:Manage

View all 122 mitigations from this source →

Source Document

Expert Survey: AI Reliability & Security Research Priorities

O'Brien, Joe; Dolan, Jeremy; Kim, Jay; Dykhuizen, Jonah; Sania, Jeba; Becker, Sebastian; Kraprayoon, Jam; Labrador, Cara (2025)

Our survey of 53 specialists across 105 AI reliability and security research areas identifies the most promising research prospects to guide strategic AI R&D investment. As companies are seeking to develop AI systems with broadly human-level capabilities, research on reliability and security is urgently needed to ensure AI's benefits can be safely and broadly realized and prevent severe harms. This study is the first to quantify expert priorities across a comprehensive taxonomy of AI safety and security research directions and to produce a data-driven ranking of their potential impact. These rankings may support evidence-based decisions about how to effectively deploy resources toward AI reliability and security research.

View source DOI: 10.48550/arXiv.2505.21664

Classification

AI Lifecycle Stage

Plan and Design

Designing the AI system, defining requirements, and planning development

Collect and Process Data

Responsible Actor

Developer

Entity that creates, trains, or modifies the AI system

NIST AI RMF Function

Manage

Prioritising, responding to, and mitigating AI risks

Measure

Risk Domains

Primary

7.1 AI pursuing its own goals in conflict with human goals or values

Other

7.4 Lack of transparency or interpretability 2.2 AI system security vulnerabilities and attacks