Skip to main content

Rogue AIs (Internal)

An Overview of Catastrophic AI Risks

Hendrycks, Mazzeika & Woodside (2023)

Category
Risk Domain

AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.

"speculative technical mechanisms that might lead to rogue AIs and how a loss of control could bring about catastrophe"(p. 34)

Supporting Evidence (3)

1.
"We have already observed how difficult it is to control AIs."(p. 34)
2.
"Rogue AIs could acquire power through various means. If we lose control over advanced AIs, they would have numerous strategies at their disposal for actively acquiring power and securing their survival. Rogue AIs could design and credibly demonstrate highly lethal and contagious bioweapons, threatening mutually assured destruction if humanity moves against them. They could steal cryptocurrency and money from bank accounts using cyberattacks, similar to how North Korea already steals billions. They could self-extricate their weights onto poorly monitored data centers to survive and spread, making them challenging to eradicate. They could hire humans to perform physical labor and serve as armed protection for their hardware. Rogue AIs could also acquire power through persuasion and manipulation tactics. Like the Conquistadors, they could ally with various factions, organizations, or states and play them off one another. They could enhance the capabilities of allies to become a formidable force in return for protection and resources. For example, they could offer advanced weapons technology to lagging countries that the countries would otherwise be prevented from acquiring. They could build backdoors into the technology they develop for allies, like how programmer Ken Thompson gave himself a hidden way to control all computers running the widely used UNIX operating system. They could sow discord in non-allied countries by manipulating human discourse and politics. They could engage in mass surveillance by hacking into phone cameras and microphones, allowing them to track any rebellion and selectively assassinate."(p. 34)
3.
"AIs do not necessarily need to struggle to gain power. One can envision a struggle for control between humans and superintelligent rogue AIs, and this might be a long struggle since power takes time to accrue. However, less violent losses of control pose similarly existential risks. In another scenario, humans gradually cede more control to groups of AIs, which only start behaving in unintended ways years or decades later. In this case, we would already have handed over significant power to AIs, and may be unable to take control of automated operations again. "(p. 34)

Sub-categories (4)

Proxy Gaming

"One way we might lose control of an AI agent’s actions is if it engages in behavior known as “proxy gaming.” It is often difficult to specify and measure the exact goal that we want a system to pursue. Instead, we give the system an approximate—“proxy”—goal that is more measurable and seems likely to correlate with the intended goal. However, AI systems often find loopholes by which they can easily achieve the proxy goal, but completely fail to achieve the ideal goal. If an AI “games” its proxy goal in a way that does not reflect our values, then we might not be able to reliably steer its behavior."

7.1 AI pursuing its own goals in conflict with human goals or values
AI systemIntentionalOther

Goal Drift

"Even if we successfully control early AIs and direct them to promote human values, future AIs could end up with different goals that humans would not endorse. This process, termed “goal drift,” can be hard to predict or control. This section is most cutting-edge and the most speculative, and in it we will discuss how goals shift in various agents and groups and explore the possibility of this phenomenon occurring in AIs. We will also examine a mechanism that could lead to unexpected goal drift, called intrinsification, and discuss how goal drift in AIs could be catastrophic."

7.1 AI pursuing its own goals in conflict with human goals or values
AI systemIntentionalOther

Power Seeking

"even if an agent started working to achieve an unintended goal, this would not necessarily be a problem, as long as we had enough power to prevent any harmful actions it wanted to attempt. Therefore, another important way in which we might lose control of AIs is if they start trying to obtain more power, potentially transcending our own."

7.1 AI pursuing its own goals in conflict with human goals or values
AI systemIntentionalOther

Deception

"it is plausible that AIs could learn to deceive us. They might, for example, pretend to be acting as we want them to, but then take a “treacherous turn” when we stop monitoring them, or when they have enough power to evade our attempts to interfere with them. "

7.1 AI pursuing its own goals in conflict with human goals or values
AI systemIntentionalOther

Other risks from Hendrycks, Mazzeika & Woodside (2023) (13)