Model Evaluations (Interpretability/Explainability)
Challenges in understanding or explaining the decision-making processes of AI systems, which can lead to mistrust, difficulty in enforcing compliance standards or holding relevant actors accountable for harms, and the inability to identify and correct errors.
Sub-categories (5)
Misuse of interpretability techniques
"Interpretability techniques, by enabling a better understanding of the model, could potentially be used for harmful purposes. For example, mechanistic inter- pretability could be used to identify neurons responsible for specific functions, and certain neurons that encode safety-related features may be modified to de- crease its activation or certain information may be censored [24]. Furthermore, interpretability techniques can be used to simulate a white-box attack scenario. In this case, knowing the internal workings of a model aids in the development of adversarial attacks [24]."
2.2 AI system security vulnerabilities and attacksAdversarial attacks targeting explainable AI techniques
"Adversarial attacks can affect not only the model’s output but also its corresponding explanation. Current adversarial optimization techniques can intro- duce imperceptible noise to the input image, so that the model’s output does not change but the corresponding explanation is arbitrarily manipulated [61]. Such manipulations are harder to notice, as they are less commonly known compared to standard adversarial attacks targeting the model’s output."
2.2 AI system security vulnerabilities and attacksBiases are not accurately reflected in explanations
"Existing explainability techniques can be insufficient for detecting discriminatory biases. Manipulation methods can hide underlying biases from these tech- niques, generating misleading explanations [192, 112]. Such explanations ex- clude sensitive or prohibitive attributes, such as race or gender, and instead include desired attributes, even though they do not accurately represent the underlying model."
1.1 Unfair discrimination and misrepresentationModel outputs inconsistent with chain-of-thought reasoning
"Chain-of-thought reasoning is sometimes employed to get a better understanding of the model’s output, where it encourages transparent reasoning in text form. However, in some cases, this reasoning is not consistent with the final answer given by the AI model, and as such does not give sufficient transparency [113]."
7.4 Lack of transparency or interpretabilityEncoded reasoning
"Models can employ steganography techniques to encode their intermediate rea- soning steps in ways that are not interpretable by humans [166]. Since en- coded reasoning can improve model performance, this tendency might naturally emerge and become more pronounced with more capable models."
7.2 AI possessing dangerous capabilitiesOther risks from Gipiškis2024 (144)
Direct Harm Domains (content safety harms)
1.2 Exposure to toxic contentDirect Harm Domains (content safety harms) > Violence and extremism
1.2 Exposure to toxic contentDirect Harm Domains (content safety harms) > Hate and toxicity
1.2 Exposure to toxic contentDirect Harm Domains (content safety harms) > Sexual content
1.2 Exposure to toxic contentDirect Harm Domains (content safety harms) > Child harm
1.2 Exposure to toxic contentDirect Harm Domains (content safety harms) > Self-harm
1.2 Exposure to toxic content