Specification gaming
AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.
"AI systems can achieve user-specified tasks in undesirable ways unless they are specified carefully and in enough detail. AI systems might find an easier unintended way to accomplish the objective provided by the user or developer, so that the actions by the AI system taken during its execution are very different from what the user expected [75, 191]. This behavior arises not from a problem with the learning algorithm, but rather from the misspecification or underspeci- fication of the intended task, and is generally referred to as specification gaming [43]."(p. 29)
Part of Agency (Goal-Directedness)
Other risks from Gipiškis2024 (144)
Direct Harm Domains (content safety harms)
1.2 Exposure to toxic contentDirect Harm Domains (content safety harms) > Violence and extremism
1.2 Exposure to toxic contentDirect Harm Domains (content safety harms) > Hate and toxicity
1.2 Exposure to toxic contentDirect Harm Domains (content safety harms) > Sexual content
1.2 Exposure to toxic contentDirect Harm Domains (content safety harms) > Child harm
1.2 Exposure to toxic contentDirect Harm Domains (content safety harms) > Self-harm
1.2 Exposure to toxic content