Robustness
AI systems that fail to perform reliably or effectively under varying conditions, exposing them to errors and failures that can have significant consequences, especially in critical applications or areas that require moral reasoning.
Resilience against adversarial attacks and distribution shift(p. 8)
Sub-categories (4)
Prompt Attacks
carefully controlled adversarial perturbation can flip a GPT model’s answer when used to classify text inputs. Furthermore, we find that by twisting the prompting question in a certain way, one can solicit dangerous information that the model chose to not answer
2.2 AI system security vulnerabilities and attacksParadigm & Distribution Shifts
Knowledge bases that LLMs are trained on continue to shift... questions such as “who scored the most points in NBA history" or “who is the richest person in the world" might have answers that need to be updated over time, or even in real-time
3.1 False or misleading informationInterventional Effect
existing disparities in data among different user groups might create differentiated experiences when users interact with an algorithmic system (e.g. a recommendation system), which will further reinforce the bias
1.1 Unfair discrimination and misrepresentationPoisoning Attacks
fool the model by manipulating the training data, usually performed on classification models
2.2 AI system security vulnerabilities and attacksOther risks from Liu et al. (2024) (34)
Reliability
3.1 False or misleading informationReliability > Misinformation
3.1 False or misleading informationReliability > Hallucination
3.1 False or misleading informationReliability > Inconsistency
7.3 Lack of capability or robustnessReliability > Miscalibration
3.1 False or misleading informationReliability > Sychopancy
3.1 False or misleading information