Technical vulnerabilities (The risk of misalignment)

Regulating under Uncertainty: Governance Options for Generative AI

G'sell (2024)

Source DOI

Sub-category

Risk Domain

7AI System Safety, Failures & Limitations

7.1AI pursuing its own goals in conflict with human goals or values

AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.

"To assess whether an AI model is reliable or robust, it is crucial to consider whether the model is “aligned.” “Alignment” focuses on whether an AI model effectively operates in accordance with the goals established by its designers.238 A misaligned AI model may pursue some objectives, but not the intended ones. Therefore, misaligned AI models can malfunction and cause harm."(p. 63)

Entity— Who or what caused the harm

Human

Due to a decision or action made by humans

AI system

Due to a decision or action made by an AI system

Other

Due to some other reason or is ambiguous

Intent— Whether the harm was intentional or accidental

Intentional

Due to an expected outcome from pursuing a goal

Unintentional

Due to an unexpected outcome from pursuing a goal

Other

Without clearly specifying the intentionality

Timing— Whether the risk is pre- or post-deployment

Pre-deployment

Occurring before the AI is deployed

Post-deployment

Occurring after the AI model has been trained and deployed

Other

Without a clearly specified time of occurrence

Supporting Evidence (1)

"Aligning an AI model poses significant challenges for developers due to the difficulty in specifying a comprehensive range of desired and undesired behaviors. Additionally, AI models can identify loopholes that allow them to achieve the specified objective efficiently but in unintended and potentially harmful ways.240 They may develop unwanted instrumental strategies, such as seeking power, as these strategies can help them achieve their specified objectives"(p. 63)

Part of Technical and operational risks