BackMitigating social bias

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Mitigating social bias

She (2024)|LLM classified

Mitigation Taxonomy

1AI System

Technical mechanisms and engineering interventions that directly modify how an AI system processes inputs, generates outputs, or operates, including changes to models, training procedures, runtime behaviors, and supporting hardware.

LLM Classification Details

Reasoning

Modifying training data composition to reduce learned social bias patterns in the model.

Code: 1.1.1Version: v0.6Classified: Feb 6, 2026

Part of

Value Misalignment

Sub-mitigations (8)

Mitigating social bias: pre-processing mitigation

Mitigating bias in LLMs often begins with pre-processing strategies, where techniques like data balancing and demographic-aware fine-tuning are employed to ensure that training samples are more representative and diverse. These methods help address selection biases by broadening the range of data used during training (Dhamala et al., 2021b; Smith et al., 2022). Additionally, data cleaning plays a crucial role in this stage by removing biased data and synthesizing unbiased samples, which are essential steps to minimize the introduction of biases into LLMs

1.1.1 Training Data

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

Mitigating social bias: pre-training mitigation

During the pre-training phase, various strategies can be applied to mitigate bias for LLMs. Modifying the neural architecture of an LLM, such as adjusting layers and encoders, can help reduce inherent biases as the LLM learns from data (Li et al., 2020a). Another approach is to adjust the loss function to prioritize fairness over perplexity, ensuring that LLM outputs are more equitable. Selective parameter updates allow for finetuning specific parts of a model to control bias without necessitating a full retraining process, making this an efficient method for bias mitigation (Lu et al., 2020; Garimella et al., 2022). Regular audits of model algorithms and architectures ensure that these mitigation strategies are continuously effective and that new biases are not introduced during updates or retraining (Zayed et al., 2023).

1.1 Model

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Mitigating social bias: post-training mitigation

Post-training is crucial for reducing bias in LLMs after the pre-training phase. One common approach is supervised fine-tuning, where a pre-trained model is fine-tuned on more balanced and representative data to reduce bias in specific tasks or domains (Devlin et al., 2019; Huang & Xiong, 2024b). Fine-tuning on curated or debiased datasets can help mitigate inherent biases learned from the original training data. Adversarial debiasing is another effective technique where an adversarial model is trained to detect and neutralize biases in LLM outputs (Zhang et al., 2018), forcing LLMs to generate more fair and balanced representations.

1.1.2 Learning Objectives

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Mitigating social bias: inference mitigation

Bias mitigation efforts can extend into the inference stage, where specific techniques are used to adjust LLM outputs dynamically. Modifying decoding strategies, such as adjusting the probabilities for next-token or sequence generation, can help reduce bias in real-time without further training (Lauscher et al., 2021). Additionally, adjusting attention weights during inference allows for the modulation of the model’s focus, promoting fairness in generated outputs (Park et al., 2023b). Debiasing components, like AdapterFusion (Gira et al., 2022), can dynamically address biases as they arise during the generation process of LLMs. Techniques such as gradient-based decoding enable the neutral rewriting of biased or harmful text, ensuring that final outputs align with fairness goals (Joniak & Aizawa, 2022b). Moreover, online monitoring of model outputs is necessary to detect emerging biases, allowing for timely interventions and adjustments to maintain model integrity.

1.2.1 Guardrails & Filtering

Lifecycle:Operate and MonitorActor:DeployerAIRM:Manage

Mitigating social bias: interpretability and transparency

In addition to the aforementioned bias mitigation methods used in the various stages of LLMs, enhancing transparency of LLMs is also critical for bias mitigation. By providing clear rationales for generated texts and detailing the sources behind these decisions, LLMs can offer deep insights into their decision-making processes to users. This transparency not only helps in understanding how biases may have influenced the outputs but also empowers users to question and challenge any biased results, thereby promoting more ethical and fair AI systems (Chung et al., 2023a).

1.1.4 Model Architecture

Lifecycle:Operate and MonitorActor:DeveloperAIRM:Manage

Multi-Objective Optimization for Fairness and Safety

Balancing performance, fairness, and safety in LLMs is challenging, especially when optimizing models for token prediction accuracy might inadvertently amplify biases. A promising future direction is the development of multi-objective optimization frameworks that prioritize both fairness and safety alongside performance metrics. By modifying loss functions and training objectives, models can be optimized to reduce bias while still maintaining high accuracy. Furthermore, ongoing research into fairness-aware algorithms that incorporate societal safety as a key objective could ensure that LLMs are better aligned with ethical and legal standards

1.1.2 Learning Objectives

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Interdisciplinary Collaboration and Ethical AI Governance

Effective bias mitigation and safety in LLMs require collaboration across different disciplines, including machine learning, ethics, law, and social sciences. Developing frameworks for cross-disciplinary collaboration, where technical experts and policy makers work together, will be crucial for ensuring that AI systems are deployed responsibly and ethically. Moreover, ethical AI governance frameworks that provide clear guidelines and regulations for bias detection, model auditing, and risk mitigation should be established. These frameworks could ensure that LLM developers are held accountable for the societal impacts of their systems, fostering greater transparency and trust in AI systems.

3.3.1 Industry Coordination

Lifecycle:Other (outside lifecycle)Actor:Governance ActorAIRM:Govern

Integration of Societal and Cultural Awareness

Future LLMs should aim to integrate societal and cultural awareness into their decision-making process. By embedding knowledge of diverse social and cultural contexts into LLMs, they can become more sensitive to the nuances of different communities and avoid generating outputs that perpetuate harmful stereotypes. This requires training LLMs not only on diverse datasets but also on datasets that capture the subtleties of different cultural and social practices, languages, and dialects. Incorporating culture-level context into LLMs could further enhance their ability to generate fair and unbiased content, improving overall safety in global applications.

1.1.1 Training Data

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Other mitigations from She (2024) (83)

Value Misalignment

99.9 Other

Lifecycle:Unable to classifyActor:DeveloperAIRM:Unable to classify

Value Misalignment > Privacy protection

1 AI System

Lifecycle:Other (multiple stages)Actor:DeveloperAIRM:Manage

Value Misalignment > Methods for mitigating toxicity

1 AI System

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Value Misalignment > Methods for mitigating LLM amorality

1 AI System

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Robustness to attack

1 AI System

Lifecycle:Other (multiple stages)Actor:DeveloperAIRM:Other

Robustness to attack > Red teaming

Red teaming is widely used in LLMs to explore their safety vulnerabilities prior to the deployment of them. Red teaming can be broadly categorized into two distinct types: manual red teaming and automated red teaming

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

View all 83 mitigations from this source →

Source Document

Large Language Model Safety: A Holistic Survey

Shi, Dan; Shen, Tianhao; Huang, Yufei; Li, Zhigen; Leng, Yongqi; Jin, Renren; Liu, Chuang; Wu, Xinwei; Guo, Zishan; Yu, Linhao; Shi, Ling; Jiang, Bojian; Xiong, Deyi (2024)

The rapid development and deployment of large language models (LLMs) have introduced a new frontier in artificial intelligence, marked by unprecedented capabilities in natural language understanding and generation. However, the increasing integration of these models into critical applications raises substantial safety concerns, necessitating a thorough examination of their potential risks and associated mitigation strategies. This survey provides a comprehensive overview of the current landscape of LLM safety, covering four major categories: value misalignment, robustness to adversarial attacks, misuse, and autonomous AI risks. In addition to the comprehensive review of the mitigation methodologies and evaluation resources on these four aspects, we further explore four topics related to LLM safety: the safety implications of LLM agents, the role of interpretability in enhancing LLM safety, the technology roadmaps proposed and abided by a list of AI companies and institutes for LLM safety, and AI governance aimed at LLM safety with discussions on international cooperation, policy proposals, and prospective regulatory directions. Our findings underscore the necessity for a proactive, multifaceted approach to LLM safety, emphasizing the integration of technical solutions, ethical considerations, and robust governance frameworks. This survey is intended to serve as a foundational resource for academy researchers, industry practitioners, and policymakers, offering insights into the challenges and opportunities associated with the safe integration of LLMs into society. Ultimately, it seeks to contribute to the safe and beneficial development of LLMs, aligning with the overarching goal of harnessing AI for societal advancement and well-being. A curated list of related papers has been publicly available at https://github.com/tjunlp-lab/Awesome-LLM-Safety-Papers.

View source DOI: 10.48550/arXiv.2412.17686

Classification

AI Lifecycle Stage

Other (multiple stages)

Applies across multiple lifecycle stages

Collect and Process DataBuild and Use ModelVerify and Validate

Responsible Actor

Developer

Entity that creates, trains, or modifies the AI system

NIST AI RMF Function

Manage

Prioritising, responding to, and mitigating AI risks

Risk Domains

Primary

1.1 Unfair discrimination and misrepresentation

Other

1.3 Unequal performance across groups