This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Technical mechanisms operating on non-model components of the AI system without modifying model weights. Components include: input/output interfaces, runtime environment, guardrail/monitoring classifiers, tool chain, and hardware.
Also in AI System
Similar to cybersecurity protocols, AI providers can prepare in advance for potential adversarial action from AI systems by developing and stress-testing protocols for potential adversarial action [179]. Such adversarial action may range from hiding code exploits, to strategic deception or loss of control related behaviors. AI systems may be adversarial due to malicious actors (e.g., data poisoning), or due to unintended issues of control (lack of alignment with human intent). Alongside improving the system alignment, external monitoring and restrictions may be deployed in advance to “catch” these adversarial model malfunctions before harm is done. Like with cybersecurity red teaming, any protocol which assumes adversarial action can be stress tested in advance. For example, this can be done by projecting the plausible capabilities and affordances of the adversarial actor (e.g., in the case of a tech company - a hacker, or a disgruntled employee). From this process, measures that better anticipate possible adversarial action can be deployed, and catch unsuccessful attempts at adversarial behavior. If such unsuccessful attempts are caught, they can be reported in incident reports, and serve as an input to further risk analysis. Depending on their severity, it may be advisable to recall the system if deployed.
Examples of these protocols include [84]: • Flagging of suspicious frontier AI system code generation for human oversight, with the help of better proven (“trusted”) AI systems or automated keyword detection. • Using a better proven (“trusted”) AI system to edit the code produced by a frontier AI system if suspicious.• Monitoring of AI system with another instance of this same system, with filters to prevent collusion. • Restriction of access of the AI system to private information.
Reasoning
Runtime monitoring protocols track potentially adversarial GPAI system behavior for abuse detection.
Monitoring
Model development
2.4 Engineering & DevelopmentModel development > Data-related
1.1 ModelModel evaluations
2.2.2 Testing & EvaluationModel evaluations > General evaluations
2.2.2 Testing & EvaluationModel evaluations > Benchmarking
3.2.1 Benchmarks & EvaluationModel evaluations > Red teaming
2.2.2 Testing & EvaluationRisk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems
Gipiškis, Rokas; San Joaquin, Ayrton; Chin, Ze Shen; Regenfuß, Adrian; Gil, Ariel; Holtman, Koen (2024)
Organizations and governments that develop, deploy, use, and govern AI must coordinate on effective risk mitigation. However, the landscape of AI risk mitigation frameworks is fragmented, uses inconsistent terminology, and has gaps in coverage. This paper introduces a preliminary AI Risk Mitigation Taxonomy to organize AI risk mitigations and provide a common frame of reference. The Taxonomy was developed through a rapid evidence scan of 13 AI risk mitigation frameworks published between 2023-2025, which were extracted into a living database of 831 distinct AI risk mitigations.
Operate and Monitor
Running, maintaining, and monitoring the AI system post-deployment
Deployer
Entity that integrates and deploys the AI system for end users
Measure
Quantifying, testing, and monitoring identified AI risks