Post-deployment ongoing monitoring for u…

BackMonitoring protocols for potentially adversarial GPAI systems

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Monitoring protocols for potentially adversarial GPAI systems

Gipiskis (2024)|LLM classified

Mitigation Taxonomy

1AI System

1.2Non-Model

Technical mechanisms operating on non-model components of the AI system without modifying model weights. Components include: input/output interfaces, runtime environment, guardrail/monitoring classifiers, tool chain, and hardware.

Also in AI System

1.1 Model

Definitionp. 41

Similar to cybersecurity protocols, AI providers can prepare in advance for potential adversarial action from AI systems by developing and stress-testing protocols for potential adversarial action [179]. Such adversarial action may range from hiding code exploits, to strategic deception or loss of control related behaviors. AI systems may be adversarial due to malicious actors (e.g., data poisoning), or due to unintended issues of control (lack of alignment with human intent). Alongside improving the system alignment, external monitoring and restrictions may be deployed in advance to “catch” these adversarial model malfunctions before harm is done. Like with cybersecurity red teaming, any protocol which assumes adversarial action can be stress tested in advance. For example, this can be done by projecting the plausible capabilities and affordances of the adversarial actor (e.g., in the case of a tech company - a hacker, or a disgruntled employee). From this process, measures that better anticipate possible adversarial action can be deployed, and catch unsuccessful attempts at adversarial behavior. If such unsuccessful attempts are caught, they can be reported in incident reports, and serve as an input to further risk analysis. Depending on their severity, it may be advisable to recall the system if deployed.

Additional Informationp. 41-42

Examples of these protocols include [84]: • Flagging of suspicious frontier AI system code generation for human oversight, with the help of better proven (“trusted”) AI systems or automated keyword detection. • Using a better proven (“trusted”) AI system to edit the code produced by a frontier AI system if suspicious.• Monitoring of AI system with another instance of this same system, with filters to prevent collusion. • Restriction of access of the AI system to private information.

LLM Classification Details

Reasoning

Runtime monitoring protocols track potentially adversarial GPAI system behavior for abuse detection.

Code: 2.3.3Version: v0.6Classified: Feb 6, 2026

Part of

Monitoring

Other mitigations from Gipiskis (2024) (112)

Model development

2.4 Engineering & Development

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Unable to classify

Model development > Data-related

1.1 Model

Lifecycle:Collect and Process DataActor:Unable to classifyAIRM:Unable to classify

Model evaluations

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Model evaluations > General evaluations

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Model evaluations > Benchmarking

3.2.1 Benchmarks & Evaluation

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Measure

Model evaluations > Red teaming

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

View all 112 mitigations from this source →

Source Document

Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

Gipiškis, Rokas; San Joaquin, Ayrton; Chin, Ze Shen; Regenfuß, Adrian; Gil, Ariel; Holtman, Koen (2024)

Organizations and governments that develop, deploy, use, and govern AI must coordinate on effective risk mitigation. However, the landscape of AI risk mitigation frameworks is fragmented, uses inconsistent terminology, and has gaps in coverage. This paper introduces a preliminary AI Risk Mitigation Taxonomy to organize AI risk mitigations and provide a common frame of reference. The Taxonomy was developed through a rapid evidence scan of 13 AI risk mitigation frameworks published between 2023-2025, which were extracted into a living database of 831 distinct AI risk mitigations.

View source DOI: 10.48550/arXiv.2410.23472

Classification

AI Lifecycle Stage

Operate and Monitor

Running, maintaining, and monitoring the AI system post-deployment

Verify and Validate

Responsible Actor

Deployer

Entity that integrates and deploys the AI system for end users

Developer

NIST AI RMF Function

Measure

Quantifying, testing, and monitoring identified AI risks

Manage

Risk Domains

Primary

7.2 AI possessing dangerous capabilities

Other

7.1 AI pursuing its own goals in conflict with human goals or values 2.2 AI system security vulnerabilities and attacks

Monitoring protocols for potentially adversarial GPAI systems

Gipiskis (2024)|LLM classified

Mitigation Taxonomy

1AI System

1.2Non-Model

Also in AI System

1.1 Model

Definitionp. 41

Additional Informationp. 41-42

LLM Classification Details

Reasoning

Runtime monitoring protocols track potentially adversarial GPAI system behavior for abuse detection.

Code: 2.3.3Version: v0.6Classified: Feb 6, 2026

Source Document

Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

Gipiškis, Rokas; San Joaquin, Ayrton; Chin, Ze Shen; Regenfuß, Adrian; Gil, Ariel; Holtman, Koen (2024)

Classification

AI Lifecycle Stage

Operate and Monitor

Running, maintaining, and monitoring the AI system post-deployment

Verify and Validate

Responsible Actor

Deployer

Entity that integrates and deploys the AI system for end users

Developer

NIST AI RMF Function

Measure

Quantifying, testing, and monitoring identified AI risks

Manage