Skip to main content
Home/Risks/Gabriel et al. (2024)/Specification gaming

Specification gaming

The Ethics of Advanced AI Assistants

Gabriel et al. (2024)

Sub-category
Risk Domain

AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.

"Specification gaming (Krakovna et al., 2020) occurs when some faulty feedback is provided to the assistant in the training data (i.e. the training objective O does not fully capture what the user/designer wants the assistant to do). It is typified by the sort of behaviour that exploits loopholes in the task specification to satisfy the literal specification of a goal without achieving the intended outcome."(p. 60)

Part of Goal-related failures

Other risks from Gabriel et al. (2024) (69)