Model Evaluations (Interpretability/Explainability)

Sub-categories (5)

Misuse of interpretability techniques

"Interpretability techniques, by enabling a better understanding of the model, could potentially be used for harmful purposes. For example, mechanistic inter- pretability could be used to identify neurons responsible for specific functions, and certain neurons that encode safety-related features may be modified to de- crease its activation or certain information may be censored [24]. Furthermore, interpretability techniques can be used to simulate a white-box attack scenario. In this case, knowing the internal workings of a model aids in the development of adversarial attacks [24]."

2.2 AI system security vulnerabilities and attacks

HumanIntentionalOther

Adversarial attacks targeting explainable AI techniques

"Adversarial attacks can affect not only the model’s output but also its corresponding explanation. Current adversarial optimization techniques can intro- duce imperceptible noise to the input image, so that the model’s output does not change but the corresponding explanation is arbitrarily manipulated [61]. Such manipulations are harder to notice, as they are less commonly known compared to standard adversarial attacks targeting the model’s output."

2.2 AI system security vulnerabilities and attacks

HumanIntentionalOther

Biases are not accurately reflected in explanations

"Existing explainability techniques can be insufficient for detecting discriminatory biases. Manipulation methods can hide underlying biases from these tech- niques, generating misleading explanations [192, 112]. Such explanations ex- clude sensitive or prohibitive attributes, such as race or gender, and instead include desired attributes, even though they do not accurately represent the underlying model."

1.1 Unfair discrimination and misrepresentation

OtherOtherOther

Model outputs inconsistent with chain-of-thought reasoning

"Chain-of-thought reasoning is sometimes employed to get a better understanding of the model’s output, where it encourages transparent reasoning in text form. However, in some cases, this reasoning is not consistent with the final answer given by the AI model, and as such does not give sufficient transparency [113]."

7.4 Lack of transparency or interpretability

AI systemUnintentionalPost-deployment

Encoded reasoning

"Models can employ steganography techniques to encode their intermediate rea- soning steps in ways that are not interpretable by humans [166]. Since en- coded reasoning can improve model performance, this tendency might naturally emerge and become more pronounced with more capable models."

7.2 AI possessing dangerous capabilities

AI systemIntentionalPost-deployment