Training-related (Adversarial examples)
Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.
"Adversarial examples [198, 83] refer to data that are designed to fool an AI model by inducing unintended behavior. They do this by exploiting spurious correlations learned by the model. They are part of inference-time attacks, where the examples are test examples. They generalize to different model architectures and models trained on different training sets."(p. 12)
Supporting Evidence (1)
"Unintended behavior can range from incorrect predictions with respect to the ground-truth prediction to outputs that are generally considered undesirable (e.g., toxic or harmful). For example, when an autonomous vehicle’s sensor sees a stop sign with an adversarial sticker, the vehicle’s AI system may misclassify the stop sign as an indicator for the vehicle to accelerate [72]."
Other risks from Gipiškis2024 (144)
Direct Harm Domains (content safety harms)
1.2 Exposure to toxic contentDirect Harm Domains (content safety harms) > Violence and extremism
1.2 Exposure to toxic contentDirect Harm Domains (content safety harms) > Hate and toxicity
1.2 Exposure to toxic contentDirect Harm Domains (content safety harms) > Sexual content
1.2 Exposure to toxic contentDirect Harm Domains (content safety harms) > Child harm
1.2 Exposure to toxic contentDirect Harm Domains (content safety harms) > Self-harm
1.2 Exposure to toxic content