Back4.3 Linear Probes

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

4.3 Linear Probes

FMF (2025)|LLM classified

Mitigation Taxonomy

1AI System

1.2Non-Model

1.2.3Monitoring & Detection

Runtime behavior observation, anomaly detection, and activity logging.

Also in Non-Model

1.2.1 Guardrails & Filtering1.2.2 Runtime Environment1.2.4 Security Infrastructure1.2.5 Provenance & Watermarking

Definitionp. 13, 14

Linear probes, also referred to as “probes”, are simple classifiers trained to predict specific properties based on a model's activations. These activations are the intermediate computational states produced as the model processes input and represent the internal calculations or processing taking place inside a model while generating an output. Probes can be trained to check these activations for patterns related to undesirable behavior, such as generating harmful content or being deceptive.

Additional Informationp. 14

Probes offer several advantages over other monitoring methods: they require less compute than running a prompted language model and introduce minimal latency to the model's response time; their simplicity and low computational cost also make them relatively easy to retrain with new data or adjust during development, such as changing detection thresholds to be more or less strict; and they can potentially identify problematic internal states even when the input or output appears benign, which may be particularly useful for misalignment detection. The limitations of probes include generally needing to be retrained for each new model version, although this retraining process can often be relatively quick. Probes may also face generalization issues in detecting different or more complex examples encountered in real-world use. Lastly, implementation choices, such as which layers of the model to probe and how to aggregate probe scores across the tokens output by the model (e.g., averaging across all tokens, focusing on specific critical areas of the response, or taking the highest score), involve tradeoffs that affect overall accuracy and sensitivity.

LLM Classification Details

Reasoning

Linear probes extract interpretable features from model representations, modifying how learned parameters are analyzed post-training.

Code: 1.1.4Version: v0.6Classified: Feb 6, 2026

Part of

Detection and Intervention Mitigations

Detection and intervention mitigations rely on automated methods to detect model usage (e.g., inputs and outputs) that may give rise to undesired behavior. These detection methods are paired with intervention mechanisms that respond proportionally to detected concerns. The subsections below describe various detection methods, while this introduction outlines how these methods are combined with different intervention methods. Detection and intervention methods can be applied to both misuse risks and misalignment risks, where models might pursue unintended goals or exhibit deceptive behavior autonomously.

Other mitigations from FMF (2025) (31)

Capability Limitation Mitigations

Capability limitation mitigations aim to prevent models from possessing knowledge or abilities that could enable harm. These methods alter the model’s weights or training process, so that it cannot assist with harmful actions when prompted by humans or autonomously pursue harmful objectives.

1.1.3 Capability Modification

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Capability Limitation Mitigations > Data Filtering

Data filtering involves removing content from training datasets that could lead to dual-use or potentially harmful capabilities. Developers can use several methods: automated classifiers to identify and remove content related to weapons development, detailed attack methodologies, or other high-risk domains; keyword-based filters to exclude documents containing specific terminology or instructions of concern; and machine learning models trained to recognize subtle patterns in content that might contribute to dangerous capabilities.

1.1.1 Training Data

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

Capability Limitation Mitigations > Exploratory Methods

Beyond data filtering, researchers are investigating additional capability limitation approaches

1.1.3 Capability Modification

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Capability Limitation Mitigations

Capability limitation mitigations aim to prevent models from possessing knowledge or abilities that could enable harm. These methods alter the model's weights or training process, so that it cannot assist with harmful actions when prompted by humans or autonomously pursue harmful objectives. However, the effectiveness of these mitigations is an active area of research, and they can currently be circumvented if dual-use knowledge (knowledge that has both benign and harmful applications) is added in the context window during inference or fine-tuning.

1.1.3 Capability Modification

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Capability Limitation Mitigations > 2.1 Data Filtering

1.1.1 Training Data

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

Capability Limitation Mitigations > 2.2 Exploratory Methods

Beyond data filtering, researchers are investigating additional capability limitation approaches. However, these methods face technical challenges, and their effectiveness remains uncertain. ● Model distillation could create specialized versions of frontier models with capabilities limited to specific domains. For example, a model could excel at medical diagnosis while lacking knowledge needed for biological weapons development. While the capability limitations may be more fundamental than post-hoc safety training, it remains unclear how effectively this approach prevents harmful capabilities from being reconstructed. Additionally, multiple specialized models would be needed to cover various use cases, increasing development and maintenance costs. ● Targeted unlearning attempts to remove specific dangerous capabilities from models after initial training, offering a more precise alternative to full retraining. Possible approaches include fine-tuning on datasets to overwrite specific knowledge while preserving general capabilities, or modifying how models internally structure and access particular information. However, these methods may be reversible with relatively modest effort – restoring "unlearned" capabilities through targeted fine-tuning with small datasets. Models may also regenerate removed knowledge by inferring from adjacent information that remains accessible. While research continues on these approaches, developers currently rely more heavily on post-deployment mitigations that can be more reliably implemented and assessed.

1.1.3 Capability Modification

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

View all 31 mitigations from this source →

Source Document

Frontier Mitigations

Frontier Model Forum (2025)

Frontier mitigations are protective measures implemented on frontier models, with the goal of reducing the risk of potential high-severity harms, especially those related to national security and public safety, that could arise from their advanced capabilities. This report discusses emerging industry practices for implementing and assessing frontier mitigations. It focuses on mitigations for managing risks in three primary domains: chemical, biological, radiological and nuclear (CBRN) information threats; advanced cyber threats; and advanced autonomous behavior threats. Given the nascent state of frontier mitigations, this report describes the range of controls and mitigation strategies being employed or researched by Frontier Model Forum members and documents the known limitations of these approaches.

View source

Classification

AI Lifecycle Stage

Operate and Monitor

Running, maintaining, and monitoring the AI system post-deployment

Build and Use ModelVerify and Validate

Responsible Actor

Developer

Entity that creates, trains, or modifies the AI system

NIST AI RMF Function

Measure

Quantifying, testing, and monitoring identified AI risks

Manage

Risk Domains

Primary

7.4 Lack of transparency or interpretability

Other

7.1 AI pursuing its own goals in conflict with human goals or values 1.2 Exposure to toxic content