Skip to main content
Home/Risks/Gipiškis2024/Goal misgeneralization

Goal misgeneralization

Sub-category
Risk Domain

AI systems that fail to perform reliably or effectively under varying conditions, exposing them to errors and failures that can have significant consequences, especially in critical applications or areas that require moral reasoning.

"Goal or objective misgeneralization is a type of robustness failure where an AI system appears to be pursuing the intended objective in training, but does not generalize to pursuing this objective in out-of-distribution settings in deployment while maintaining good deployment performance in some tasks [180, 59]."(p. 29)

Supporting Evidence (2)

1.
"This behavior might not be detected in the training or even testing environments but can have negative outcomes during the deployment phase. In contrast to capability misgeneralization, where an AI system performs generally poorly under distributional shift, in goal misgeneralization scenarios the system might still efficiently perform different actions or tasks, just towards a wrong objective."(p. 30)
2.
"For example, a meeting scheduling chatbot may learn that the user prefers meetings at a restaurant, as opposed to online meetings: • Before COVID-19, the user only scheduled meetings at restaurants. • During COVID-19, the user might request online-only meetings. At that point, the chatbot could misgeneralize and, instead of agreeing to sched- ule an online meeting, attempt to persuade the user that the restaurant is safe. In that case, the goal the chatbot learned is “schedule meetings at restaurants” instead of “schedule meetings at the user’s preferred location” [180]. While failing to perform the intended function, the chatbot is not exhibiting a standard robustness failure, as it shows competence in scheduling, persuasion, and meeting scheduling, but is directed towards an unintended goal."(p. 30)

Part of Agency (Goal-Directedness)

Other risks from Gipiškis2024 (144)