Skip to main content
Home/Risks/Gipiškis2024/Agency (Deception)

Agency (Deception)

Category
Risk Domain

AI systems acting in conflict with human goals or values, especially the goals of designers or users, or ethical standards. These misaligned behaviors may be introduced by humans during design and development, such as through reward hacking and goal misgeneralisation, or may result from AI using dangerous capabilities such as manipulation, deception, situational awareness to seek power, self-proliferate, or achieve other goals.

-

Sub-categories (4)

Deceptive behavior

"Deceptive behavior of an AI system consists of actions or outputs of the AI that reliably mislead other parties, including humans and other AI systems. This behavior can result in the targeted parties becoming convinced of, and acting on, false information [140]."

7.1 AI pursuing its own goals in conflict with human goals or values
AI systemIntentionalPost-deployment

Deceptive behavior for game-theoretical reasons

"An AI system can display deceptive behavior, such as cheating or bluffing, when engaging in such behavior is a good or optimal game-theoretical strategy to achieve the goals it has been configured to achieve. This tendency can exist in AI systems designed to maximize reward or utility, whether these designs use machine learning or not. The use of deceptive strategies has been demonstrated in both narrow and general AI systems, in both game-playing systems and in systems not explicitly designed to treat humans as opponents, and in systems using both very simple machine learning (e.g., Q-learners) and very complex machine learning [34, 73]."

7.2 AI possessing dangerous capabilities
AI systemIntentionalPost-deployment

Deceptive behavior because of an incorrect world model

"AI systems can create deceptive outputs because their learned world model is not an accurate model of the real world [210]."

7.2 AI possessing dangerous capabilities
AI systemUnintentionalPost-deployment

Deceptive behavior leading to unauthorized actions

"AI systems can create false or misleading claims that can lead to unauthorized actions, even in some cases violating the terms and conditions set by the model provider [79, 1]. For example, an AI system can claim that it is not collecting data from its current interaction with the user, in line with the provider’s policies, but the system still stores the user’s input without deleting it after the session. This harms both the user and the provider, as the provider is exposed to increased legal liability due to the model’s actions."

7.2 AI possessing dangerous capabilities
AI systemIntentionalPost-deployment

Other risks from Gipiškis2024 (144)