Skip to main content
Home/Risks/Gabriel et al. (2024)/Deceptive alignment

Deceptive alignment

The Ethics of Advanced AI Assistants

Gabriel et al. (2024)

Sub-category
Risk Domain

AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.

"Here, the agent develops its own internalised goal, G, which is misgeneralised and distinct from the training reward, R. The agent also develops a capability for situational awareness (Cotra, 2022): it can strategically use the information about its situation (i.e. that it is an ML model being trained using a particular training setup, e.g. RL fine-tuning with training reward, R) to its advantage. Building on these foundations, the agent realises that its optimal strategy for doing well at its own goal G is to do well on R during training and then pursue G at deployment – it is only doing well on R instrumentally so that it does not get its own goal G changed through a learning update... Ultimately, if deceptive alignment were to occur, an advanced AI assistant could appear to be successfully aligned but pursue a different goal once it was out in the wild."(p. 60)

Part of Goal-related failures

Other risks from Gabriel et al. (2024) (69)