Skip to main content
Home/Risks/Gipiškis2024/Specification gaming

Specification gaming

Sub-category
Risk Domain

AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.

"AI systems can achieve user-specified tasks in undesirable ways unless they are specified carefully and in enough detail. AI systems might find an easier unintended way to accomplish the objective provided by the user or developer, so that the actions by the AI system taken during its execution are very different from what the user expected [75, 191]. This behavior arises not from a problem with the learning algorithm, but rather from the misspecification or underspeci- fication of the intended task, and is generally referred to as specification gaming [43]."(p. 29)

Part of Agency (Goal-Directedness)

Other risks from Gipiškis2024 (144)