Skip to main content
Home/Risks/Ji et al. (2023)/Limitations of Reward Modeling

Limitations of Reward Modeling

AI Alignment: A Comprehensive Survey

Ji et al. (2023)

Sub-category
Risk Domain

AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.

"Limitations of Reward Modeling. Training reward models using comparison feedback can pose significantchallenges in accurately capturing human values. For example, these models may unconsciously learn suboptimal or incomplete objectives, resulting in reward hacking (Zhuang and Hadfield-Menell, 2020; Skalse et al.,2022). Meanwhile, using a single reward model may struggle to capture and specify the values of a diversehuman society (Casper et al., 2023b)."(p. 4)

Part of Causes of Misalignment

Other risks from Ji et al. (2023) (16)