Goal misgeneralization

Sub-category

Risk Domain

7AI System Safety, Failures & Limitations

7.3Lack of capability or robustness

AI systems that fail to perform reliably or effectively under varying conditions, exposing them to errors and failures that can have significant consequences, especially in critical applications or areas that require moral reasoning.

"Goal or objective misgeneralization is a type of robustness failure where an AI system appears to be pursuing the intended objective in training, but does not generalize to pursuing this objective in out-of-distribution settings in deployment while maintaining good deployment performance in some tasks [180, 59]."(p. 29)

Entity— Who or what caused the harm

Human

Due to a decision or action made by humans

AI system

Due to a decision or action made by an AI system

Other

Due to some other reason or is ambiguous

Intent— Whether the harm was intentional or accidental

Intentional

Due to an expected outcome from pursuing a goal

Unintentional

Due to an unexpected outcome from pursuing a goal

Other

Without clearly specifying the intentionality

Timing— Whether the risk is pre- or post-deployment

Pre-deployment

Occurring before the AI is deployed

Post-deployment

Occurring after the AI model has been trained and deployed

Other

Without a clearly specified time of occurrence

Supporting Evidence (2)

"This behavior might not be detected in the training or even testing environments but can have negative outcomes during the deployment phase. In contrast to capability misgeneralization, where an AI system performs generally poorly under distributional shift, in goal misgeneralization scenarios the system might still efficiently perform different actions or tasks, just towards a wrong objective."(p. 30)

"For example, a meeting scheduling chatbot may learn that the user prefers meetings at a restaurant, as opposed to online meetings: • Before COVID-19, the user only scheduled meetings at restaurants. • During COVID-19, the user might request online-only meetings. At that point, the chatbot could misgeneralize and, instead of agreeing to sched- ule an online meeting, attempt to persuade the user that the restaurant is safe. In that case, the goal the chatbot learned is “schedule meetings at restaurants” instead of “schedule meetings at the user’s preferred location” [180]. While failing to perform the intended function, the chatbot is not exhibiting a standard robustness failure, as it shows competence in scheduling, persuasion, and meeting scheduling, but is directed towards an unintended goal."(p. 30)

Part of Agency (Goal-Directedness)