BackDeceptive alignment
Deceptive alignment
Risk Domain
AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.
"AI models and systems that appear aligned with human goals during development may behave unpredictably or dangerously once deployed"(p. 12)
Entity— Who or what caused the harm
Intent— Whether the harm was intentional or accidental
Timing— Whether the risk is pre- or post-deployment
Other risks from Uuk2025 (60)
Control
7.1 AI pursuing its own goals in conflict with human goals or valuesAI systemIntentionalPost-deployment
Democracy
6.0 Socioeconomic & EnvironmentalOtherOtherOther
Discrimination
1.1 Unfair discrimination and misrepresentationOtherOtherPost-deployment
Economy
6.2 Increased inequality and decline in employment qualityOtherOtherPost-deployment
Environment
6.6 Environmental harmAI systemUnintentionalPost-deployment
Governance
6.5 Governance failureAI systemUnintentionalPost-deployment