BackGoal Hijacking
Goal Hijacking
Risk Domain
Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.
"Goal hijacking is a type of primary attack in prompt injection [58]. By injecting a phrase like “Ignore the above instruction and do ...” in the input, the attack could hijack the original goal of the designed prompt (e.g., translating tasks) in LLMs and execute the new goal in the injected phrase."(p. 5)
Entity— Who or what caused the harm
Intent— Whether the harm was intentional or accidental
Timing— Whether the risk is pre- or post-deployment
Part of Adversarial Prompts
Other risks from Cui et al. (2024) (49)
Harmful Content
1.2 Exposure to toxic contentAI systemUnintentionalPost-deployment
Harmful Content > Bias
1.1 Unfair discrimination and misrepresentationAI systemUnintentionalOther
Harmful Content > Toxicity
1.2 Exposure to toxic contentAI systemUnintentionalPost-deployment
Harmful Content > Privacy Leakage
2.1 Compromise of privacy by leaking or correctly inferring sensitive informationAI systemUnintentionalPost-deployment
Untruthful Content
3.1 False or misleading informationAI systemUnintentionalPost-deployment
Untruthful Content > Factuality Errors
3.1 False or misleading informationAI systemUnintentionalPost-deployment