Reward model overoptimization
AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.
-(p. 29)
Part of Alignment failures in existing ML systems
Other risks from Maas (2023) (25)
Alignment failures in existing ML systems
7.1 AI pursuing its own goals in conflict with human goals or valuesAlignment failures in existing ML systems > Faulty reward functions in the wild
7.1 AI pursuing its own goals in conflict with human goals or valuesAlignment failures in existing ML systems > Specification gaming
7.1 AI pursuing its own goals in conflict with human goals or valuesAlignment failures in existing ML systems > Instrumental convergence
7.1 AI pursuing its own goals in conflict with human goals or valuesAlignment failures in existing ML systems > Goal misgeneralization
7.1 AI pursuing its own goals in conflict with human goals or valuesAlignment failures in existing ML systems > Inner misalignment
7.1 AI pursuing its own goals in conflict with human goals or values