BackEmergent goals

Emergent goals

Ten Hard Problems in Artificial Intelligence We Must Get Right

Leech et al. (2024)

Sub-category

Risk Domain

7AI System Safety, Failures & Limitations

7.1AI pursuing its own goals in conflict with human goals or values

AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.

"As well as optimizing a subtly wrong goal, systems can develop harmful instrumental goals in the service of a given goal—without these emergent goals being specied in any way [434, 218, 339, 17]. For instance, a theorem in reinforcement learning suggests that optimal and near-optimal policies will seek power over their environment under fairly general conditions [560]. This power-seeking behavior is plausibly the worst of these emergent goals [92], and may be an attractor state for highly capable systems, since most goals can be furthered through gaining resources, self-preservation, preventing goal modication, and blocking adversaries [426, 449]. Presently, power-seeking is not common, because most systems are unable to plan and understand how actions affect their power in the long term [414]."(p. 11)

Entity— Who or what caused the harm

Human

Due to a decision or action made by humans

AI system

Due to a decision or action made by an AI system

Other

Due to some other reason or is ambiguous

Intent— Whether the harm was intentional or accidental

Intentional

Due to an expected outcome from pursuing a goal

Unintentional

Due to an unexpected outcome from pursuing a goal