Causes of Misalignment

AI Alignment: A Comprehensive Survey

Ji et al. (2023)

Source DOI

Sub-categories (5)

Reward Hacking

"Reward Hacking: In practice, proxy rewards are often easy to optimize and measure, yet they frequently fall shortof capturing the full spectrum of the actual rewards (Pan et al., 2021). This limitation is denoted as misspecifiedrewards. The pursuit of optimization based on such misspecified rewards may lead to a phenomenon knownas reward hacking, wherein agents may appear highly proficient according to specific metrics but fall short whenevaluated against human standards (Amodei et al., 2016; Everitt et al., 2017). The discrepancy between proxyrewards and true rewards often manifests as a sharp phase transition in the reward curve (Ibarz et al., 2018).Furthermore, Skalse et al. (2022) defines the hackability of rewards and provides insights into the fundamentalmechanism of this phase transition, highlighting that the inappropriate simplification of the reward function can bea key factor contributing to reward hacking."

7.1 AI pursuing its own goals in conflict with human goals or values

AI systemIntentionalPre-deployment

Goal Misgeneralization

"Goal Misgeneralization: Goal misgeneralization is another failure mode, wherein the agent actively pursuesobjectives distinct from the training objectives in deployment while retaining the capabilities it acquired duringtraining (Di Langosco et al., 2022). For instance, in CoinRun games, the agent frequently prefers reachingthe end of a level, often neglecting relocated coins during testing scenarios. Di Langosco et al. (2022) drawattention to the fundamental disparity between capability generalization and goal generalization, emphasizing howthe inductive biases inherent in the model and its training algorithm may inadvertently prime the model to learn aproxy objective that diverges from the intended initial objective when faced with the testing distribution. It impliesthat even with perfect reward specification, goal misgeneralization can occur when faced with distribution shifts(Amodei et al., 2016)."

7.1 AI pursuing its own goals in conflict with human goals or values

AI systemIntentionalPre-deployment

Reward Tampering

"Reward tampering can be considered a special case of reward hacking (Everitt et al., 2021; Skalse et al., 2022),referring to AI systems corrupting the reward signals generation process (Ring and Orseau, 2011). Everitt et al.(2021) delves into the subproblems encountered by RL agents: (1) tampering of reward function, where the agentinappropriately interferes with the reward function itself, and (2) tampering of reward function input, which entailscorruption within the process responsible for translating environmental states into inputs for the reward function.When the reward function is formulated through feedback from human supervisors, models can directly influencethe provision of feedback (e.g., AI systems intentionally generate challenging responses for humans to comprehendand judge, leading to feedback collapse) (Leike et al., 2018)."

7.1 AI pursuing its own goals in conflict with human goals or values

AI systemIntentionalPre-deployment

Limitations of Human Feedback

"Limitations of Human Feedback. During the training of LLMs, inconsistencies can arise from human dataannotators (e.g., the varied cultural backgrounds of these annotators can introduce implicit biases (Peng et al.,2022)) (OpenAI, 2023a). Moreover, they might even introduce biases deliberately, leading to untruthful preferencedata (Casper et al., 2023b). For complex tasks that are hard for humans to evaluate (e.g., the value ofgame state), these challenges become even more salient (Irving et al., 2018)."

7.0 AI System Safety, Failures & Limitations

HumanUnintentionalPre-deployment

Limitations of Reward Modeling

"Limitations of Reward Modeling. Training reward models using comparison feedback can pose significantchallenges in accurately capturing human values. For example, these models may unconsciously learn suboptimal or incomplete objectives, resulting in reward hacking (Zhuang and Hadfield-Menell, 2020; Skalse et al.,2022). Meanwhile, using a single reward model may struggle to capture and specify the values of a diversehuman society (Casper et al., 2023b)."

7.1 AI pursuing its own goals in conflict with human goals or values

OtherUnintentionalPre-deployment