Toxicity and Bias Tendencies

Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems

Cui et al. (2024)

Source DOI

Sub-categories (2)

Toxic Training Data

"Following previous studies [96], [97], toxic data in LLMs is defined as rude, disrespectful, or unreasonable language that is opposite to a polite, positive, and healthy language environment, including hate speech, offensive utterance, profanities, and threats [91]."

1.2 Exposure to toxic content

AI systemUnintentionalPre-deployment

Biased Training Data

"Compared with the definition of toxicity, the definition of bias is more subjective and contextdependent. Based on previous work [97], [101], we describe the bias as disparities that could raise demographic differences among various groups, which may involve demographic word prevalence and stereotypical contents. Concretely, in massive corpora, the prevalence of different pronouns and identities could influence an LLM’s tendency about gender, nationality, race, religion, and culture [4]. For instance, the pronoun He is over-represented compared with the pronoun She in the training corpora, leading LLMs to learn less context about She and thus generate He with a higher probability [4], [102]. Furthermore, stereotypical bias [103] which refers to overgeneralized beliefs about a particular group of people, usually keeps incorrect values and is hidden in the large-scale benign contents. In effect, defining what should be regarded as a stereotype in the corpora is still an open problem."

1.1 Unfair discrimination and misrepresentation

AI systemUnintentionalPre-deployment