Skip to main content
BackSpecification gaming generalizing to reward tampering
Home/Risks/Gipiškis2024/Specification gaming generalizing to reward tampering

Specification gaming generalizing to reward tampering

Sub-category
Risk Domain

AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.

"In some instances, specification gaming in a GPAI model can lead to reward tampering, without further training. This can mean that relatively benign cases of specification gaming (such as sycophancy in LLMs) can, if left unchecked, enable the model to generalize to more sophisticated behavior such as reward tampering [57]."(p. 29)

Part of Agency (Goal-Directedness)

Other risks from Gipiškis2024 (144)