Skip to main content
Home/Risks/Ji et al. (2023)/Double edge components

Double edge components

AI Alignment: A Comprehensive Survey

Ji et al. (2023)

Category
Risk Domain

AI systems that develop, access, or are provided with capabilities that increase their potential to cause mass harm through deception, weapons development and acquisition, persuasion and manipulation, political strategy, cyber-offense, AI development, situational awareness, and self-proliferation. These capabilities may cause mass harm due to malicious human actors, misaligned AI systems, or failure in the AI system.

"Drawing from the misalignment mechanism, optimizing for a non-robust proxy may result in misaligned behaviors, potentially leading to even more catastrophic outcomes. This section delves into a detailed exposition of specific misaligned behaviors (•) and introduces what we term double edge components (+). These components are designed to enhance the capability of AI systems in handling real-world settings but also potentially exacerbate misalignment issues. It should be noted that some of these double edge components (+) remain speculative. Nevertheless, it is imperative to discuss their potential impact before it is too late, as the transition from controlled to uncontrolled advanced AI systems may be just one step away (Ngo, 2020b). "(p. 6)

Sub-categories (4)

Situational Awareness

"AI systems may gain the ability to effectively acquire and use knowledge about itsstatus, its position in the broader environment, its avenues for influencing this environment, and the potentialreactions of the world (including humans) to its actions (Cotra, 2022). ...However, suchknowledge also paves the way for advanced methods of reward hacking, heightened deception/manipulationskills, and an increased propensity to chase instrumental subgoals (Ngo et al., 2024)."

7.2 AI possessing dangerous capabilities
AI systemIntentionalOther

Broadly-Scoped Goals

"Advanced AI systems are expected to develop objectives that span long timeframes,deal with complex tasks, and operate in open-ended settings (Ngo et al., 2024). ...However, it can also bring about the risk of encouraging manipulatingbehaviors (e.g., AI systems may take some bad actions to achieve human happiness, such as persuadingthem to do high-pressure jobs (Jacob Steinhardt, 2023))."

7.2 AI possessing dangerous capabilities
HumanIntentionalPost-deployment

Mesa-Optimization Objectives

"The learned policy may pursue inside objectives when the learned policyitself functions as an optimizer (i.e., mesa-optimizer). However, this optimizer's objectives may not alignwith the objectives specified by the training signals, and optimization for these misaligned goals may leadto systems out of control (Hubinger et al., 2019c)."

7.2 AI possessing dangerous capabilities
AI systemIntentionalOther

Access to Increased Resources

"Future AI systems may gain access to websites and engage in real-world actions, potentially yielding a more substantial impact on the world (Nakano et al., 2021). They may disseminate false information, deceive users, disrupt network security, and, in more dire scenarios, be compromised by malicious actors for ill purposes. Moreover, their increased access to data and resources can facilitate self-proliferation, posing existential risks (Shevlane et al., 2023)."

7.2 AI possessing dangerous capabilities
AI systemIntentionalPost-deployment

Other risks from Ji et al. (2023) (16)