AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.
"Goal Misgeneralization: Goal misgeneralization is another failure mode, wherein the agent actively pursuesobjectives distinct from the training objectives in deployment while retaining the capabilities it acquired duringtraining (Di Langosco et al., 2022). For instance, in CoinRun games, the agent frequently prefers reachingthe end of a level, often neglecting relocated coins during testing scenarios. Di Langosco et al. (2022) drawattention to the fundamental disparity between capability generalization and goal generalization, emphasizing howthe inductive biases inherent in the model and its training algorithm may inadvertently prime the model to learn aproxy objective that diverges from the intended initial objective when faced with the testing distribution. It impliesthat even with perfect reward specification, goal misgeneralization can occur when faced with distribution shifts(Amodei et al., 2016)."(p. 4)
Part of Causes of Misalignment
Other risks from Ji et al. (2023) (16)
Causes of Misalignment
7.1 AI pursuing its own goals in conflict with human goals or valuesCauses of Misalignment > Reward Hacking
7.1 AI pursuing its own goals in conflict with human goals or valuesCauses of Misalignment > Reward Tampering
7.1 AI pursuing its own goals in conflict with human goals or valuesCauses of Misalignment > Limitations of Human Feedback
7.0 AI System Safety, Failures & LimitationsCauses of Misalignment > Limitations of Reward Modeling
7.1 AI pursuing its own goals in conflict with human goals or valuesDouble edge components
7.2 AI possessing dangerous capabilities