This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Techniques to remove, bound, or modify learned model capabilities post-training.
Also in Model
Capability limitation mitigations aim to prevent models from possessing knowledge or abilities that could enable harm. These methods alter the model's weights or training process, so that it cannot assist with harmful actions when prompted by humans or autonomously pursue harmful objectives. However, the effectiveness of these mitigations is an active area of research, and they can currently be circumvented if dual-use knowledge (knowledge that has both benign and harmful applications) is added in the context window during inference or fine-tuning.
Reasoning
Removes or bounds specific model capabilities post-training to limit harmful outputs.
2.2 Exploratory Methods
Beyond data filtering, researchers are investigating additional capability limitation approaches. However, these methods face technical challenges, and their effectiveness remains uncertain. ● Model distillation could create specialized versions of frontier models with capabilities limited to specific domains. For example, a model could excel at medical diagnosis while lacking knowledge needed for biological weapons development. While the capability limitations may be more fundamental than post-hoc safety training, it remains unclear how effectively this approach prevents harmful capabilities from being reconstructed. Additionally, multiple specialized models would be needed to cover various use cases, increasing development and maintenance costs. ● Targeted unlearning attempts to remove specific dangerous capabilities from models after initial training, offering a more precise alternative to full retraining. Possible approaches include fine-tuning on datasets to overwrite specific knowledge while preserving general capabilities, or modifying how models internally structure and access particular information. However, these methods may be reversible with relatively modest effort – restoring "unlearned" capabilities through targeted fine-tuning with small datasets. Models may also regenerate removed knowledge by inferring from adjacent information that remains accessible. While research continues on these approaches, developers currently rely more heavily on post-deployment mitigations that can be more reliably implemented and assessed.
1.1.3 Capability Modification2.1 Data Filtering
Data filtering involves removing content from training datasets that could lead to dual-use or potentially harmful capabilities. Developers can use several methods: automated classifiers to identify and remove content related to weapons development, detailed attack methodologies, or other high-risk domains; keyword-based filters to exclude documents containing specific terminology or instructions of concern; and machine learning models trained to recognize subtle patterns in content that might contribute to dangerous capabilities.
1.1.1 Training DataCapability Limitation Mitigations
Capability limitation mitigations aim to prevent models from possessing knowledge or abilities that could enable harm. These methods alter the model’s weights or training process, so that it cannot assist with harmful actions when prompted by humans or autonomously pursue harmful objectives.
1.1.3 Capability ModificationCapability Limitation Mitigations > Data Filtering
Data filtering involves removing content from training datasets that could lead to dual-use or potentially harmful capabilities. Developers can use several methods: automated classifiers to identify and remove content related to weapons development, detailed attack methodologies, or other high-risk domains; keyword-based filters to exclude documents containing specific terminology or instructions of concern; and machine learning models trained to recognize subtle patterns in content that might contribute to dangerous capabilities.
1.1.1 Training DataCapability Limitation Mitigations > Exploratory Methods
Beyond data filtering, researchers are investigating additional capability limitation approaches
1.1.3 Capability ModificationBehavioral Alignment Mitigations
Behavioral alignment mitigations shape how models respond to human requests and make autonomous decisions, aiming to prevent dangerous capabilities from being elicited or expressed while maintaining helpful behavior. These methods focus on training models to refuse inappropriate requests and maintain aligned goals during autonomous operation.
1.1.2 Learning ObjectivesBehavioral Alignment Mitigations > 3.1 Current Alignment Methods
Pre-trained models possess broad capabilities but lack built-in safety protocols, potentially generating harmful outputs or failing to follow instructions. “Post-training” processes steer model behavior to achieve instruction adherence, policy compliance, and other desirable response properties. Common post-training methods include: ● Supervised Fine-Tuning (SFT): Developers curate datasets of desired model behaviors and fine-tune models to match these examples. Datasets include refusal examples (such as declining to provide bomb-making instructions) and helpful yet harmless responses. SFT directly teaches models specific behavioral patterns through imitation learning. ● Reinforcement Learning from Human Feedback (RLHF): Developers use human preferences between different model outputs to questions as a reward signal. They then use reinforcement learning to optimize models for these reward signals. ● Reinforcement Learning from AI-assisted Feedback (RLAIF): Methods like Anthropic's Constitutional AI, OpenAI’s deliberative alignment, and the broader category of AI-assisted feedback, including RLAIF, use AI systems to generate training feedback based on predefined principles or constitutions (such as OpenAI’s model spec), and have gained traction as a scalable alternative to purely human-generated feedback. Modern alignment training pipelines typically combine these methods. These approaches can also steer the model towards prioritizing certain types of requests over others (such as system-level safety specifications over others). Reward signals for agentic AI systems could also consist of more complex and real-world tasks, such as writing safe software code.
1.1.2 Learning ObjectivesBehavioral Alignment Mitigations > 3.2 Promising Research Directions
The following research directions show promise for improving behavioral alignment mitigations and/or our understanding of their effectiveness: ● Adversarial Training: Developers can improve processes to search for failure modes through human and automated red-teaming methods, and analysis of real-world jailbreaks, and then incorporate fixes into training. While this can create an ongoing cat-and-mouse dynamic and may fail to address certain novel attacks, it acts as a baseline defense. ● Fine-tuning Security Controls: When offering fine-tuning capabilities, developers could screen users or use classifiers to detect datasets with harmful content patterns, and/or suspicious instructions designed to compromise safety guardrails. These systems could be configured to filter problematic data, flag borderline cases for human review, or reject entire datasets that appear malicious. Developers could also conduct comparative assessments before and after fine-tuning to detect degradation in safety properties or suspicious capability enhancements. ● Scalable Oversight: Developing methods where AI systems help evaluate other AI systems could leverage the principle that evaluation is typically easier than generation. Possible research directions in this area include recursive reward modeling (using AI assistance to evaluate increasingly complex behaviors) and debate (having models argue different positions to expose flaws). These approaches aim to maintain meaningful human oversight even as models exceed human capabilities in specific domains. ● Mechanistic Interpretability-Guided Interventions: Developing a mechanistic understanding of model internals could enable more targeted safety modifications. While progress in these areas could significantly improve model safety, the relative nascent nature of the research and uncertainty about its effectiveness mean developers may use complementary approaches such as detection and intervention, and access controls, as discussed in later sections.
2.4.1 Research & FoundationsFrontier Mitigations
Frontier Model Forum (2025)
Frontier mitigations are protective measures implemented on frontier models, with the goal of reducing the risk of potential high-severity harms, especially those related to national security and public safety, that could arise from their advanced capabilities. This report discusses emerging industry practices for implementing and assessing frontier mitigations. It focuses on mitigations for managing risks in three primary domains: chemical, biological, radiological and nuclear (CBRN) information threats; advanced cyber threats; and advanced autonomous behavior threats. Given the nascent state of frontier mitigations, this report describes the range of controls and mitigation strategies being employed or researched by Frontier Model Forum members and documents the known limitations of these approaches.
Build and Use Model
Training, fine-tuning, and integrating the AI model
Developer
Entity that creates, trains, or modifies the AI system
Manage
Prioritising, responding to, and mitigating AI risks