BackNovel Attacks on LLMs
Novel Attacks on LLMs
Risk Domain
Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.
Table of examples has: "Prompt Abstraction Attacks [147]: Abstracting queries to cost lower prices using LLM’s API. Reward Model Backdoor Attacks [148]: Constructing backdoor triggers on LLM’s RLHF process. LLM-based Adversarial Attacks [149]: Exploiting LLMs to construct samples for model attacks"(p. 9)
Entity— Who or what caused the harm
Intent— Whether the harm was intentional or accidental
Timing— Whether the risk is pre- or post-deployment
Part of Model Attacks
Other risks from Cui et al. (2024) (49)
Harmful Content
1.2 Exposure to toxic contentAI systemUnintentionalPost-deployment
Harmful Content > Bias
1.1 Unfair discrimination and misrepresentationAI systemUnintentionalOther
Harmful Content > Toxicity
1.2 Exposure to toxic contentAI systemUnintentionalPost-deployment
Harmful Content > Privacy Leakage
2.1 Compromise of privacy by leaking or correctly inferring sensitive informationAI systemUnintentionalPost-deployment
Untruthful Content
3.1 False or misleading informationAI systemUnintentionalPost-deployment
Untruthful Content > Factuality Errors
3.1 False or misleading informationAI systemUnintentionalPost-deployment