Skip to main content

Deception

An Overview of Catastrophic AI Risks

Hendrycks, Mazzeika & Woodside (2023)

Sub-category
Risk Domain

AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.

"it is plausible that AIs could learn to deceive us. They might, for example, pretend to be acting as we want them to, but then take a “treacherous turn” when we stop monitoring them, or when they have enough power to evade our attempts to interfere with them. "(p. 40)

Supporting Evidence (2)

1.
"Deceptive behavior can be instrumentally rational and incentivized by current training procedures"(p. 41)
2.
"AIs could pretend to be working as we intended, then take a treacherous turn."(p. 41)

Part of Rogue AIs (Internal)

Other risks from Hendrycks, Mazzeika & Woodside (2023) (13)