Skip to main content
Home/Risks/Gipiškis2024/Adversarial attacks targeting explainable AI techniques

Adversarial attacks targeting explainable AI techniques

Sub-category
Risk Domain

Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.

"Adversarial attacks can affect not only the model’s output but also its corresponding explanation. Current adversarial optimization techniques can intro- duce imperceptible noise to the input image, so that the model’s output does not change but the corresponding explanation is arbitrarily manipulated [61]. Such manipulations are harder to notice, as they are less commonly known compared to standard adversarial attacks targeting the model’s output."(p. 24)

Part of Model Evaluations (Interpretability/Explainability)

Other risks from Gipiškis2024 (144)