Alignment failures in existing ML systems
AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.
-
Sub-categories (8)
Faulty reward functions in the wild
-
7.1 AI pursuing its own goals in conflict with human goals or valuesSpecification gaming
-
7.1 AI pursuing its own goals in conflict with human goals or valuesReward model overoptimization
-
7.1 AI pursuing its own goals in conflict with human goals or valuesInstrumental convergence
-
7.1 AI pursuing its own goals in conflict with human goals or valuesGoal misgeneralization
-
7.1 AI pursuing its own goals in conflict with human goals or valuesInner misalignment
-
7.1 AI pursuing its own goals in conflict with human goals or valuesLanguage model misalignment
-
7.1 AI pursuing its own goals in conflict with human goals or valuesHarms from increasingly agentic algorithmic systems
-
7.2 AI possessing dangerous capabilitiesOther risks from Maas (2023) (25)
Dangerous capabilities in AI systems
7.2 AI possessing dangerous capabilitiesDangerous capabilities in AI systems > Situational awareness
7.2 AI possessing dangerous capabilitiesDangerous capabilities in AI systems > Acquisition of a goal to harm society
4.2 Cyberattacks, weapon development or use, and mass harmDangerous capabilities in AI systems > Acquisition of goals to seek power and control
7.1 AI pursuing its own goals in conflict with human goals or valuesDangerous capabilities in AI systems > Self-improvement
7.2 AI possessing dangerous capabilitiesDangerous capabilities in AI systems > Autonomous replication
7.2 AI possessing dangerous capabilities