Reward Tampering

BackLimitations of Human Feedback

Limitations of Reward Modeling

Home/Risks/Ji et al. (2023)/Limitations of Human Feedback

Reward Tampering

Limitations of Reward Modeling

Home/Risks/Ji et al. (2023)/Limitations of Human Feedback

Reward Tampering

Limitations of Reward Modeling

Limitations of Human Feedback

AI Alignment: A Comprehensive Survey

Ji et al. (2023)

Sub-category

"Limitations of Human Feedback. During the training of LLMs, inconsistencies can arise from human dataannotators (e.g., the varied cultural backgrounds of these annotators can introduce implicit biases (Peng et al.,2022)) (OpenAI, 2023a). Moreover, they might even introduce biases deliberately, leading to untruthful preferencedata (Casper et al., 2023b). For complex tasks that are hard for humans to evaluate (e.g., the value ofgame state), these challenges become even more salient (Irving et al., 2018)."(p. 4)

Entity— Who or what caused the harm

Human

Due to a decision or action made by humans

AI system

Due to a decision or action made by an AI system

Other

Due to some other reason or is ambiguous

Intent— Whether the harm was intentional or accidental

Intentional

Due to an expected outcome from pursuing a goal

Unintentional

Due to an unexpected outcome from pursuing a goal

Other

Without clearly specifying the intentionality

Timing— Whether the risk is pre- or post-deployment

Pre-deployment

Occurring before the AI is deployed

Post-deployment

Occurring after the AI model has been trained and deployed

Other

Without a clearly specified time of occurrence

Part of Causes of Misalignment

Other risks from Ji et al. (2023) (16)

Causes of Misalignment

7.1 AI pursuing its own goals in conflict with human goals or values

OtherOtherPre-deployment

Causes of Misalignment > Reward Hacking

7.1 AI pursuing its own goals in conflict with human goals or values

AI systemIntentionalPre-deployment

Causes of Misalignment > Goal Misgeneralization

7.1 AI pursuing its own goals in conflict with human goals or values

AI systemIntentionalPre-deployment

Causes of Misalignment > Reward Tampering

7.1 AI pursuing its own goals in conflict with human goals or values

AI systemIntentionalPre-deployment

Causes of Misalignment > Limitations of Reward Modeling

7.1 AI pursuing its own goals in conflict with human goals or values

OtherUnintentionalPre-deployment

Double edge components

7.2 AI possessing dangerous capabilities

AI systemOtherPre-deployment

View all 16 risks from this paper →