Harm caused by unaligned competent systems
AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.
"How do we ensure AI acts according to our values? Equivalently, how do we prevent poorly-understood AI systems from advancing goals we do not endorse? Whereas HP#2 concerns the prevention of harm caused by incompetent systems, HP#3 seeks to align competent AIs with humans, through methods which ensure their behavior is compatible with the user’s intentions."(p. 10)
Sub-categories (3)
Specification gaming
"AI systems game specifications [305]. For example, in 2017 an OpenAI robot trained to grasp a ball via human feedback from a xed viewpoint learned that it was easier to pretend to grasp the ball by placing its hand between the camera and the target object, as this was easier to learn than actually grasping the ball [103]."
7.1 AI pursuing its own goals in conflict with human goals or valuesEmergent goals
"As well as optimizing a subtly wrong goal, systems can develop harmful instrumental goals in the service of a given goal—without these emergent goals being specied in any way [434, 218, 339, 17]. For instance, a theorem in reinforcement learning suggests that optimal and near-optimal policies will seek power over their environment under fairly general conditions [560]. This power-seeking behavior is plausibly the worst of these emergent goals [92], and may be an attractor state for highly capable systems, since most goals can be furthered through gaining resources, self-preservation, preventing goal modication, and blocking adversaries [426, 449]. Presently, power-seeking is not common, because most systems are unable to plan and understand how actions affect their power in the long term [414]."
7.1 AI pursuing its own goals in conflict with human goals or valuesDeceptive alignment
"system learns to detect human monitoring and hides its undesirable properties—simply because any display of these properties is penalized by the feedback process, while that same feedback is usually imperfect. (Consider the problem of verifying a translation into a language you do not speak, or of checking a mathematical proof that is thousands of pages long.) [92, 259]. Rudimentary examples of deceptive alignment have been observed in current systems [322, 333]."
7.2 AI possessing dangerous capabilitiesOther risks from Leech et al. (2024) (13)
Harm caused by incompetent systems
7.3 Lack of capability or robustnessWithin-country issues: domestic inequality
6.1 Power centralization and unfair distribution of benefitsWithin-country issues: domestic inequality > Demographic diversity of researchers
6.1 Power centralization and unfair distribution of benefitsWithin-country issues: domestic inequality > Privatization of AI
6.1 Power centralization and unfair distribution of benefitsBetween-country issues: global inequality
6.1 Power centralization and unfair distribution of benefitsUnder-recognized work
6.2 Increased inequality and decline in employment quality