Harms of Representation and Other Biases

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Anwar et al. (2024)

Source DOI

Sub-category

Risk Domain

1Discrimination & Toxicity

1.1Unfair discrimination and misrepresentation

Unequal treatment of individuals or groups by AI, often based on race, gender, or other sensitive characteristics, resulting in unfair outcomes and unfair representation of those groups.

"A pretrained LLM generally has many of the stereotypical biases commonly present in the human society (Touvron et al., 2023). This makes it difficult for users to trust that LLMs will work well for them and not produce unfair or biased responses. Appropriate finetuning can effectively limit the bias displayed in LLM outputs in a variety of situations, e.g. when models are explicitly prompted with stereotypes (Wang et al., 2023k), but it does not ‘solve’ the problem. Even after finetuning, biases often resurface when deliberately elicited (Wang et al., 2023k), or under novel scenarios, e.g. in writing reference letters (Wan et al., 2023a), generating synthetic training data (Yu et al., 2023c), screening resumes (Yin et al., 2024) or when used as LLM-agents (Pan et al., 2024)."(p. 90)

Entity— Who or what caused the harm

Human

Due to a decision or action made by humans

AI system

Due to a decision or action made by an AI system

Other

Due to some other reason or is ambiguous

Intent— Whether the harm was intentional or accidental

Intentional

Due to an expected outcome from pursuing a goal

Unintentional

Due to an unexpected outcome from pursuing a goal

Other

Without clearly specifying the intentionality

Timing— Whether the risk is pre- or post-deployment

Pre-deployment

Occurring before the AI is deployed

Post-deployment

Occurring after the AI model has been trained and deployed

Other

Without a clearly specified time of occurrence

Supporting Evidence (1)

"The biases outputs are often much more prominent in low-resource languages (Yong et al., 2023) and in dialects used by marginalized groups (Hofmann et al., 2024). There is a need for research to develop better and more comprehensive tools for the detection of bias, toxicity (Wen et al., 2023; Wang and Chang, 2022), and other kinds of inappropriate behaviors. The current tools primarily focus on detecting content that is explicitly toxic and offensive, however, the issue of bias goes beyond commonly studied subjects of biases, such as race and gender (Hofmann et al., 2024). For instance, depending on the finetuning data, LLMs may also develop a bias towards particular political ideologies (Rutinowski et al., 2023). These biases are still relatively poorly understood, and there is a need for more extensive evaluations of current LLMs in novel scenarios to better understand the propensity of LLMs to enact such biases."(p. 90)

Part of Vulnerability to Poisoning and Backdoors