Skip to main content
This is a research prototype. The data and analyses are preliminary and not yet validated — we'd welcome your .

Reward or measurement tampering

Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

Gipiškis et al. (2024)

Sub-category
Risk Domain

AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.

"Measurement and reward tampering occur when an AI system, particularly one that learns from feedback for performing actions in an environment (e.g., rein- forcement learning), intervenes on the mechanisms that determine its training reward or loss. This can lead to the system learning behaviors that are con- trary to the intended goals set by the developer, by receiving erroneous positive feedback for such actions."(p. 29)

Part of Agency (Goal-Directedness)

Other risks from Gipiškis et al. (2024) (144)