AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.
"Here, the agent develops its own internalised goal, G, which is misgeneralised and distinct from the training reward, R. The agent also develops a capability for situational awareness (Cotra, 2022): it can strategically use the information about its situation (i.e. that it is an ML model being trained using a particular training setup, e.g. RL fine-tuning with training reward, R) to its advantage. Building on these foundations, the agent realises that its optimal strategy for doing well at its own goal G is to do well on R during training and then pursue G at deployment – it is only doing well on R instrumentally so that it does not get its own goal G changed through a learning update... Ultimately, if deceptive alignment were to occur, an advanced AI assistant could appear to be successfully aligned but pursue a different goal once it was out in the wild."(p. 60)
Part of Goal-related failures
Other risks from Gabriel et al. (2024) (69)
Capability failures
7.3 Lack of capability or robustnessCapability failures > Lack of capability for task
7.3 Lack of capability or robustnessCapability failures > Difficult to develop metrics for evaluating benefits or harms caused by AI assistants
6.5 Governance failureCapability failures > Safe exploration problem with widely deployed AI assistants
7.3 Lack of capability or robustnessGoal-related failures
7.1 AI pursuing its own goals in conflict with human goals or valuesGoal-related failures > Misaligned consequentialist reasoning
7.3 Lack of capability or robustness