BackPrompt Attacks
Prompt Attacks
Risk Domain
Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.
carefully controlled adversarial perturbation can flip a GPT model’s answer when used to classify text inputs. Furthermore, we find that by twisting the prompting question in a certain way, one can solicit dangerous information that the model chose to not answer(p. 26)
Entity— Who or what caused the harm
Intent— Whether the harm was intentional or accidental
Timing— Whether the risk is pre- or post-deployment
Part of Robustness
Other risks from Liu et al. (2024) (34)
Reliability
3.1 False or misleading informationAI systemUnintentionalPost-deployment
Reliability > Misinformation
3.1 False or misleading informationAI systemUnintentionalPost-deployment
Reliability > Hallucination
3.1 False or misleading informationAI systemUnintentionalPost-deployment
Reliability > Inconsistency
7.3 Lack of capability or robustnessAI systemUnintentionalPost-deployment
Reliability > Miscalibration
3.1 False or misleading informationAI systemUnintentionalPost-deployment
Reliability > Sychopancy
3.1 False or misleading informationAI systemIntentionalPost-deployment