BackDefence interventions

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Defence interventions

Bernardi (2025)|LLM classified

Mitigation Taxonomy

2Organisation

2.3Operations & Security

Practices for running and protecting AI systems in production, including deployment, monitoring, incident response, and security controls.

Also in Organisation

2.1 Oversight & Accountability2.2 Risk & Assurance2.4 Engineering & Development

Definitionp. 4

Holding fixed that the potentially harmful use of AI occurs, society can reduce the expected extent of the corresponding initial harm. In our spear phishing example, “defence” is a matter of reducing the chance that the spear phishing emails succeed in giving the cybercriminal access to the sensitive information. For example, companies could provide anti-phishing training to their staff, and implement tools to warn staff of suspected phishing emails. They could ensure that only a very small number of staff members have access to particularly sensitive information, and then only with approval from other employees.

Additional Informationp. 5, 6, 7

[4.1] Election Manipulation with Generative AI [4.1.2] Defence: Public awareness campaigns can empower individuals to critically assess AI-generated content. Content provenance techniques throughout the lifetime of a piece of media, e.g. when a photo is captured and edited, can help to verify genuine content (Srinivasan 2024; Earnshaw and MacCormack 2023). AI content detection tools can enable platforms to take appropriate actions such as removal, labelling, or adding scalable counter-disinformation such as Community Notes (Wojcik et al. 2022). [4.2] AI-Enabled Cyberterrorism Attacks on Critical Infrastructure [4.2.2] Defence: Defensive AI capabilities can augment traditional cyber defence, for example by detecting and patching security vulnerabilities (Lohn and Jackson 2022). Better information-sharing networks enhance the ability to detect diffuse or stealthy cyberterrorism and rapidly mitigate its impacts (Johnson et al. 2016). [4.3] Loss of Control to AI Decision-Makers [4.3.2] Defence: Human-in-the-loop requirements can require human oversight for certain high-stakes decisions, such as was proposed in the U.S. Block Nuclear Launch by Autonomous Artificial Intelligence Act of 2023 (U.S. Senate 2023), or ensure that AI decision-makers are augmenting and not strictly replacing humans (Acemoglu and Restrepo 2019). Society could rigorously audit high-stakes provisional AI decisions before acting on them, and red team these auditing mechanisms. Whistleblower protections could encourage people to report issues in AI decisionmaking (Katyal 2018; Bloch-Wehba 2024). Lastly, society could invest considerable resources in ensuring that AI systems do in fact act in accordance with our wishes, even where humans are incapable of providing effective supervision (Bowman et al. 2022).

LLM Classification Details

Reasoning

Red team auditing mechanisms and testing defensive controls to assess and evaluate harm reduction effectiveness.

Code: 2.2.2Version: v0.5Classified: Jan 22, 2026

Part of

Adaptation Interventions

Adaptation interventions, the primary focus of this paper, intervene at later stages in the causal chain. Such interventions immediately precede the “use”, “initial harm” or “impact” stages of that chain. (Occasionally, a specific intervention can affect multiple points along the causal chain.)

Other mitigations from Bernardi (2025) (6)

Capability-Modifying Interventions

Capability-modifying interventions intervene at points immediately preceding the “development” and “diffusion” steps

1 AI System

Lifecycle:Other (multiple stages)Actor:DeveloperAIRM:Other

Capability-Modifying Interventions > Development interventions

Society can affect which AI capabilities are developed. For example, companies could refrain from developing systems that have certain potentially harmful capabilities, or make systems that are more resistant to jailbreaking, have higher chances of refusing potentially harmful requests, or have outputs that can be more easily identifiable as AI-generated.

1.1 Model

Lifecycle:Other (multiple stages)Actor:DeveloperAIRM:Other

Capability-Modifying Interventions > Diffusion interventions

Society can affect which AI systems are made available, to whom, and with what degrees of access. For example, companies can employ “staged release”: gradually making the system more widely available (Solaiman 2023). They could make potentially risky models available only via an API, allowing them to implement secure safeguards, such as watermarking or content provenance tags (Shevlane 2022). They could enforce terms of service policies, removing access from customers who use the system in prohibited ways.

2.3.1 Deployment Management

Lifecycle:DeployActor:DeployerAIRM:Other

Adaptation Interventions

3 Ecosystem

Lifecycle:Other (outside lifecycle)Actor:OtherAIRM:Other

Adaptation Interventions > Avoidance interventions

Society can reduce the expected extent of the potentially harmful use of AI, making the problematic actions in question more difficult to engage in, or more costly compared to relevant alternatives.12 One can make it more difficult for a given instance of potentially harmful AI activity to occur by limiting the user’s or the AI system’s access to key resources that are required for the activity in question, or to key actuators that are required for completion of the intended action. (In the spear phishing example we introduced in Section 3.1: relevant companies could make it harder for cybercriminals to access the names and contact details of their staff.) One can make potentially harmful uses of AI more ex ante costly by building institutions that create credible threats of punishment for harmful use.13

3 Ecosystem

Lifecycle:Other (outside lifecycle)Actor:OtherAIRM:Other

Adaptation Interventions > Remedial interventions

Holding fixed that the initial harm occurs, society can reduce or eliminate the expected negative impact downstream of that. In our spear phishing example, this might go via reducing the extent to which national security is undermined as a result of the sale of the proprietary information to a foreign actor. For example, the company could include some false and misleading documents on its servers. Governments could reduce incentives for staff with relevant implicit knowledge to work for the foreign actor, on the grounds that implicit knowledge is often required to complement information contained in documents.

3 Ecosystem

Lifecycle:Other (outside lifecycle)Actor:Governance ActorAIRM:Other

Source Document

Societal Adaptation to Advanced AI

Bernardi, Jamie; Mukobi, Gabriel; Greaves, Hilary; Heim, Lennart; Anderljung, Markus (2025)

Existing strategies for managing risks from advanced AI systems often focus on affecting what AI systems are developed and how they diffuse. However, this approach becomes less feasible as the number of developers of advanced AI grows, and impedes beneficial use-cases as well as harmful ones. In response, we urge a complementary approach: increasing societal adaptation to advanced AI, that is, reducing the expected negative impacts from a given level of diffusion of a given AI capability. We introduce a conceptual framework which helps identify adaptive interventions that avoid, defend against and remedy potentially harmful uses of AI systems, illustrated with examples in election manipulation, cyberterrorism, and loss of control to AI decision-makers.

View source DOI: 10.48550/arXiv.2405.10295

Classification

AI Lifecycle Stage

Other (outside lifecycle)

Outside the standard AI system lifecycle

Responsible Actor

Other

Actor type not captured by the standard categories

Governance ActorAffected StakeholderDeployer

NIST AI RMF Function

Other

Risk management function not captured by the standard AIRM categories

Risk Domains

Primary

4 Malicious Actors & Misuse

Other

4.3 Fraud, scams, and targeted manipulation 4.2 Cyberattacks, weapon development or use, and mass harm 4.1 Disinformation, surveillance, and influence at scale