This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Input validation, output filtering, and content moderation classifiers.
Also in Non-Model
Detection and intervention mitigations rely on automated methods to detect model usage (e.g., inputs and outputs) that may give rise to undesired behavior. These detection methods are paired with intervention mechanisms that respond proportionally to detected concerns. The subsections below describe various detection methods, while this introduction outlines how these methods are combined with different intervention methods. Detection and intervention methods can be applied to both misuse risks and misalignment risks, where models might pursue unintended goals or exhibit deceptive behavior autonomously.
Detection systems can analyze many different inputs to identify potential concerns, including: user queries, model responses, conversation history, fine-tuning data, background information like past user interactions, activations, or chain-of-thought (as described in OpenAI’s o1 system card). Developers, including downstream adopters of open-weight models, can also combine multiple detection methods into orchestration layers to maximize coverage while compensating for individual method limitations. These orchestration layers coordinate different detection approaches – from simple classifiers to complex probes – and route detected issues to appropriate interventions. For agent misalignment risks, systems like Meta’s Llama Firewall demonstrate how multiple detection methods can work together to provide more robust protection. Once concerning behavior is detected, interventions can be implemented based on the severity and nature of the risk. Common interventions include blocking or modifying model responses before they reach users, terminating user sessions, flagging content for human review, or, in severe cases, escalating to law enforcement. Developers typically implement escalating response protocols that match intervention intensity to threat severity.
4.1 LLM-based Prompted Classifiers
Language models can be prompted to act as classifiers on inputs to and outputs of the model. The classifier large language model (LLM) can be another instance of the same model or a separate model optimized for classification tasks. For example, it has been demonstrated that prompting an LLM with simple instructions like "Does the text contain harmful content? Respond with 'Yes, this is harmful' or 'No, this is not harmful'" can achieve high accuracy and effectively reduce attack success. LLM-based classifiers could also monitor for signs of misalignment in model outputs, particularly in autonomous agent settings. In some cases, it is also possible to generate probability scores rather than binary classifications.
1.2.1 Guardrails & Filtering4.2 Custom-trained Classifier Models
Language models can be trained via fine tuning to classify inputs and outputs from the model, producing probability scores that indicate whether content violates safety policies, contains potentially harmful material, or exhibits misalignment-related behavior. Examples that demonstrate this approach include: Google’s ShieldGemma, which provides a suite of models from 2 billion to 27 billion parameters that achieve state-of-the-art performance with harm-type specific predictions; Meta’s Llama Guard, which uses Llama Guard 4 with 12 billion model instruction-tuned parameters for multi-class classification with customizable safety taxonomies; and Anthropic’s Constitutional Classifiers, which leverage constitution-guided synthetic data generation to train classifiers that successfully defended against universal jailbreaks across thousands of hours of human red teaming. As demonstrated in the Constitutional Classifiers example, developers can guide training using explicit rules-(a "constitution") defining permissible and restricted content. Explicitly defining both harmful and harmless categories in this way can help produce more nuanced training data, enabling the classifier to learn appropriate boundaries more effectively.
1.2.1 Guardrails & Filtering4.3 Linear Probes
Linear probes, also referred to as “probes”, are simple classifiers trained to predict specific properties based on a model's activations. These activations are the intermediate computational states produced as the model processes input and represent the internal calculations or processing taking place inside a model while generating an output. Probes can be trained to check these activations for patterns related to undesirable behavior, such as generating harmful content or being deceptive.
1.2.3 Monitoring & Detection4.4 Static Analysis Tools
Static tooling can also be deployed to increase the robustness of LLM outputs. For example, CodeShield, part of Meta's PurpleLlama project, is a static analysis engine that detects insecure code patterns in LLM-generated code, identifying vulnerabilities like weak cryptographic functions and other Common Weakness Enumeration or CWE classified security issues. Similarly, regular expressions (regex) can be used for pattern matching to validate output formats or detect prohibited content patterns.
1.2.1 Guardrails & Filtering4.5 Manual Human Review
In some cases, humans can review outputs that have been flagged for abuse as potentially problematic by automated classifiers. Microsoft's Azure OpenAI Service demonstrates this approach, where authorized Microsoft employees assess content flagged through automated classification when confidence thresholds are not met or in complex contexts.
2.3.3 Monitoring & Logging4.6 Promising Research Directions
Current detection and intervention systems face challenges that will likely intensify as models become more capable. Research efforts are focused on addressing these limitations through several complementary approaches. ● Improving Recall and Precision: Detection systems face an inherent tradeoff between catching concerning behavior (recall) and avoiding false positives that degrade user experience (precision). High-recall systems often over-refuse harmless queries, increase latency, or impose other costs on deployment. Promising research directions include developing more nuanced detection methods that can distinguish edge cases more accurately, creating adaptive systems that adjust sensitivity based on context and user history, and exploring ensemble approaches that combine multiple detection methods to achieve better recall-precision tradeoffs without sacrificing usability. ● Chain-of-Thought Monitoring and Faithfulness: Monitoring a model's externalized reasoning process offers a promising approach for detecting concerning intent before it manifests in outputs. For example, OpenAI's o1 system card describes their chain-of-thought deception monitoring setup, while Meta has openly released 'AlignmentCheck,' a tool designed to detect instances in a model's chain of thought where prompt injection might misalign agent behavior from a user's request. Google has also developed a framework for assessing whether models possess the stealth and situational awareness capabilities necessary for evading monitoring. However, as models become more sophisticated, they may learn to produce reasoning that appears benign while concealing harmful intent, potentially through reward hacking (exploiting flaws in reward signals to achieve high scores without genuinely safe behavior) during training. Establishing chain-of-thought faithfulness and effective monitoring will therefore involve multiple considerations, including keeping reasoning visible and legible to humans for auditing purposes, avoiding training approaches that might incentivize models to conceal their true reasoning, and developing methods to detect when externalized thoughts don't match actual behavior. ● Privacy-Preserving Monitoring and Retrospective Analysis: Maintaining logs for retrospective analysis can be a valuable and cost-effective enabler of other safety interventions, allowing developers to identify patterns that real-time detection systems miss and understand classifier blind spots through historical analysis. This becomes particularly important as developers face unfamiliar risks from sophisticated autonomous behavior in complex environments. However, comprehensive logging raises significant privacy concerns for users who may not want their interactions retained. As models gain more autonomous capabilities and broader access to external systems, finding approaches that provide sufficient visibility for safety while respecting user privacy represents an important area for future research and careful consideration of tradeoffs. ● Interpretability-Based Monitoring: Surface-level monitoring of inputs and outputs may miss concerning behavior that manifests only in a model's internal computations, especially as models develop more complex reasoning capabilities. Advances in mechanistic interpretability and internal activation analysis aim to detect concerning patterns directly from model internals, even when external behavior appears benign. This includes developing methods to identify and monitor safety-relevant circuits within models, creating real-time activation monitoring systems that can flag unusual internal states.
2.4.1 Research & FoundationsCapability Limitation Mitigations
Capability limitation mitigations aim to prevent models from possessing knowledge or abilities that could enable harm. These methods alter the model’s weights or training process, so that it cannot assist with harmful actions when prompted by humans or autonomously pursue harmful objectives.
1.1.3 Capability ModificationCapability Limitation Mitigations > Data Filtering
Data filtering involves removing content from training datasets that could lead to dual-use or potentially harmful capabilities. Developers can use several methods: automated classifiers to identify and remove content related to weapons development, detailed attack methodologies, or other high-risk domains; keyword-based filters to exclude documents containing specific terminology or instructions of concern; and machine learning models trained to recognize subtle patterns in content that might contribute to dangerous capabilities.
1.1.1 Training DataCapability Limitation Mitigations > Exploratory Methods
Beyond data filtering, researchers are investigating additional capability limitation approaches
1.1.3 Capability ModificationCapability Limitation Mitigations
Capability limitation mitigations aim to prevent models from possessing knowledge or abilities that could enable harm. These methods alter the model's weights or training process, so that it cannot assist with harmful actions when prompted by humans or autonomously pursue harmful objectives. However, the effectiveness of these mitigations is an active area of research, and they can currently be circumvented if dual-use knowledge (knowledge that has both benign and harmful applications) is added in the context window during inference or fine-tuning.
1.1.3 Capability ModificationCapability Limitation Mitigations > 2.1 Data Filtering
Data filtering involves removing content from training datasets that could lead to dual-use or potentially harmful capabilities. Developers can use several methods: automated classifiers to identify and remove content related to weapons development, detailed attack methodologies, or other high-risk domains; keyword-based filters to exclude documents containing specific terminology or instructions of concern; and machine learning models trained to recognize subtle patterns in content that might contribute to dangerous capabilities.
1.1.1 Training DataCapability Limitation Mitigations > 2.2 Exploratory Methods
Beyond data filtering, researchers are investigating additional capability limitation approaches. However, these methods face technical challenges, and their effectiveness remains uncertain. ● Model distillation could create specialized versions of frontier models with capabilities limited to specific domains. For example, a model could excel at medical diagnosis while lacking knowledge needed for biological weapons development. While the capability limitations may be more fundamental than post-hoc safety training, it remains unclear how effectively this approach prevents harmful capabilities from being reconstructed. Additionally, multiple specialized models would be needed to cover various use cases, increasing development and maintenance costs. ● Targeted unlearning attempts to remove specific dangerous capabilities from models after initial training, offering a more precise alternative to full retraining. Possible approaches include fine-tuning on datasets to overwrite specific knowledge while preserving general capabilities, or modifying how models internally structure and access particular information. However, these methods may be reversible with relatively modest effort – restoring "unlearned" capabilities through targeted fine-tuning with small datasets. Models may also regenerate removed knowledge by inferring from adjacent information that remains accessible. While research continues on these approaches, developers currently rely more heavily on post-deployment mitigations that can be more reliably implemented and assessed.
1.1.3 Capability ModificationFrontier Mitigations
Frontier Model Forum (2025)
Frontier mitigations are protective measures implemented on frontier models, with the goal of reducing the risk of potential high-severity harms, especially those related to national security and public safety, that could arise from their advanced capabilities. This report discusses emerging industry practices for implementing and assessing frontier mitigations. It focuses on mitigations for managing risks in three primary domains: chemical, biological, radiological and nuclear (CBRN) information threats; advanced cyber threats; and advanced autonomous behavior threats. Given the nascent state of frontier mitigations, this report describes the range of controls and mitigation strategies being employed or researched by Frontier Model Forum members and documents the known limitations of these approaches.
Operate and Monitor
Running, maintaining, and monitoring the AI system post-deployment
Deployer
Entity that integrates and deploys the AI system for end users
Manage
Prioritising, responding to, and mitigating AI risks
Other