AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.
"AI systems have the potential to take actions that are seemingly benignin isolation but become problematic in multi-agent or societal contexts. Classical game theory offers simplistic models for understanding these behaviors. For instance, Phelps and Russell (2023) evaluates GPT-3.5's performance in the iterated prisoner's dilemma and other social dilemmas, revealing limitations in themodel's cooperative capabilities."(p. 8)
Part of Misaligned Behaviors
Other risks from Ji et al. (2023) (16)
Causes of Misalignment
7.1 AI pursuing its own goals in conflict with human goals or valuesCauses of Misalignment > Reward Hacking
7.1 AI pursuing its own goals in conflict with human goals or valuesCauses of Misalignment > Goal Misgeneralization
7.1 AI pursuing its own goals in conflict with human goals or valuesCauses of Misalignment > Reward Tampering
7.1 AI pursuing its own goals in conflict with human goals or valuesCauses of Misalignment > Limitations of Human Feedback
7.0 AI System Safety, Failures & LimitationsCauses of Misalignment > Limitations of Reward Modeling
7.1 AI pursuing its own goals in conflict with human goals or values