Skip to main content
Home/Risks/Gipiškis2024/Misuse of interpretability techniques

Misuse of interpretability techniques

Sub-category
Risk Domain

Vulnerabilities that can be exploited in AI systems, software development toolchains, and hardware, resulting in unauthorized access, data and privacy breaches, or system manipulation causing unsafe outputs or behavior.

"Interpretability techniques, by enabling a better understanding of the model, could potentially be used for harmful purposes. For example, mechanistic inter- pretability could be used to identify neurons responsible for specific functions, and certain neurons that encode safety-related features may be modified to de- crease its activation or certain information may be censored [24]. Furthermore, interpretability techniques can be used to simulate a white-box attack scenario. In this case, knowing the internal workings of a model aids in the development of adversarial attacks [24]."(p. 24)

Part of Model Evaluations (Interpretability/Explainability)

Other risks from Gipiškis2024 (144)