Skip to main content
Home/Risks/Gipiškis2024/Reward or measurement tampering

Reward or measurement tampering

Sub-category
Risk Domain

AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.

"Measurement and reward tampering occur when an AI system, particularly one that learns from feedback for performing actions in an environment (e.g., rein- forcement learning), intervenes on the mechanisms that determine its training reward or loss. This can lead to the system learning behaviors that are con- trary to the intended goals set by the developer, by receiving erroneous positive feedback for such actions."(p. 29)

Part of Agency (Goal-Directedness)

Other risks from Gipiškis2024 (144)