Future AI systems might actively reduce human control

Capabilities and Risks from Frontier AI

DSIT (2023)

Source

Sub-category

Risk Domain

7AI System Safety, Failures & Limitations

7.1AI pursuing its own goals in conflict with human goals or values

AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.

"Loss of control could be accelerated if AI systems take actions to increase their own influence and reduce human control. This threat model is controversial - experts in AI significantly disagree on how likely it is and those who deem it is likely disagree on the timeframe."(p. 26)

Entity— Who or what caused the harm

Human

Due to a decision or action made by humans

AI system

Due to a decision or action made by an AI system

Other

Due to some other reason or is ambiguous

Intent— Whether the harm was intentional or accidental

Intentional

Due to an expected outcome from pursuing a goal

Unintentional

Due to an unexpected outcome from pursuing a goal

Other

Without clearly specifying the intentionality

Timing— Whether the risk is pre- or post-deployment

Pre-deployment

Occurring before the AI is deployed

Post-deployment

Occurring after the AI model has been trained and deployed

Other

Without a clearly specified time of occurrence

Supporting Evidence (4)

"There are two requirements for an AI system to actively reduce human control. First, it must have the disposition to take actions that would reduce human control. Second, it must have the capabilities to succeed in the face of countermeasures."(p. 26)

"AI systems might be disposed to take actions that increase their own influence and reduce human control either because a bad actor instructs them to do so, or because they have unintended goals."(p. 27)

"A bad actor could give an AI system an objective that causes it to reduce human control, for example a self-preservation objective.263 Some groups may simply want to inflict harm on broader society or raise their profile (terrorism).264 There are people who believe, for a variety of reasons, that the highly advanced AI systems of the future are natural successors to humanity.265 If there are safeguards in place, bad actors might dismantle them.266"(p. 27)

"Future advanced AI systems with unintended goals may have the disposition to reduce human control. Ensuring that AI systems do not pursue unintended goals, i.e., are not misaligned, is an unsolved technical research problem and one that is particularly challenging for highly advanced AI systems.267 Many examples of unintended goal-directed behaviour have been observed in the lab.268 Many possible unintended goals would be advanced by reducing human control.269 Future AI systems may consistently take actions that advance their goals and so such a system might, without human instruction, be disposed to take actions that reduce human control.2"(p. 27)

Part of Loss of control