This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Technical mechanisms and engineering interventions that directly modify how an AI system processes inputs, generates outputs, or operates, including changes to models, training procedures, runtime behaviors, and supporting hardware.
Reasoning
Modifying training data composition to reduce learned social bias patterns in the model.
Value Misalignment
Mitigating social bias: pre-processing mitigation
Mitigating bias in LLMs often begins with pre-processing strategies, where techniques like data balancing and demographic-aware fine-tuning are employed to ensure that training samples are more representative and diverse. These methods help address selection biases by broadening the range of data used during training (Dhamala et al., 2021b; Smith et al., 2022). Additionally, data cleaning plays a crucial role in this stage by removing biased data and synthesizing unbiased samples, which are essential steps to minimize the introduction of biases into LLMs
1.1.1 Training DataMitigating social bias: pre-training mitigation
During the pre-training phase, various strategies can be applied to mitigate bias for LLMs. Modifying the neural architecture of an LLM, such as adjusting layers and encoders, can help reduce inherent biases as the LLM learns from data (Li et al., 2020a). Another approach is to adjust the loss function to prioritize fairness over perplexity, ensuring that LLM outputs are more equitable. Selective parameter updates allow for finetuning specific parts of a model to control bias without necessitating a full retraining process, making this an efficient method for bias mitigation (Lu et al., 2020; Garimella et al., 2022). Regular audits of model algorithms and architectures ensure that these mitigation strategies are continuously effective and that new biases are not introduced during updates or retraining (Zayed et al., 2023).
1.1 ModelMitigating social bias: post-training mitigation
Post-training is crucial for reducing bias in LLMs after the pre-training phase. One common approach is supervised fine-tuning, where a pre-trained model is fine-tuned on more balanced and representative data to reduce bias in specific tasks or domains (Devlin et al., 2019; Huang & Xiong, 2024b). Fine-tuning on curated or debiased datasets can help mitigate inherent biases learned from the original training data. Adversarial debiasing is another effective technique where an adversarial model is trained to detect and neutralize biases in LLM outputs (Zhang et al., 2018), forcing LLMs to generate more fair and balanced representations.
1.1.2 Learning ObjectivesMitigating social bias: inference mitigation
Bias mitigation efforts can extend into the inference stage, where specific techniques are used to adjust LLM outputs dynamically. Modifying decoding strategies, such as adjusting the probabilities for next-token or sequence generation, can help reduce bias in real-time without further training (Lauscher et al., 2021). Additionally, adjusting attention weights during inference allows for the modulation of the model’s focus, promoting fairness in generated outputs (Park et al., 2023b). Debiasing components, like AdapterFusion (Gira et al., 2022), can dynamically address biases as they arise during the generation process of LLMs. Techniques such as gradient-based decoding enable the neutral rewriting of biased or harmful text, ensuring that final outputs align with fairness goals (Joniak & Aizawa, 2022b). Moreover, online monitoring of model outputs is necessary to detect emerging biases, allowing for timely interventions and adjustments to maintain model integrity.
1.2.1 Guardrails & FilteringMitigating social bias: interpretability and transparency
In addition to the aforementioned bias mitigation methods used in the various stages of LLMs, enhancing transparency of LLMs is also critical for bias mitigation. By providing clear rationales for generated texts and detailing the sources behind these decisions, LLMs can offer deep insights into their decision-making processes to users. This transparency not only helps in understanding how biases may have influenced the outputs but also empowers users to question and challenge any biased results, thereby promoting more ethical and fair AI systems (Chung et al., 2023a).
1.1.4 Model ArchitectureMulti-Objective Optimization for Fairness and Safety
Balancing performance, fairness, and safety in LLMs is challenging, especially when optimizing models for token prediction accuracy might inadvertently amplify biases. A promising future direction is the development of multi-objective optimization frameworks that prioritize both fairness and safety alongside performance metrics. By modifying loss functions and training objectives, models can be optimized to reduce bias while still maintaining high accuracy. Furthermore, ongoing research into fairness-aware algorithms that incorporate societal safety as a key objective could ensure that LLMs are better aligned with ethical and legal standards
1.1.2 Learning ObjectivesInterdisciplinary Collaboration and Ethical AI Governance
Effective bias mitigation and safety in LLMs require collaboration across different disciplines, including machine learning, ethics, law, and social sciences. Developing frameworks for cross-disciplinary collaboration, where technical experts and policy makers work together, will be crucial for ensuring that AI systems are deployed responsibly and ethically. Moreover, ethical AI governance frameworks that provide clear guidelines and regulations for bias detection, model auditing, and risk mitigation should be established. These frameworks could ensure that LLM developers are held accountable for the societal impacts of their systems, fostering greater transparency and trust in AI systems.
3.3.1 Industry CoordinationIntegration of Societal and Cultural Awareness
Future LLMs should aim to integrate societal and cultural awareness into their decision-making process. By embedding knowledge of diverse social and cultural contexts into LLMs, they can become more sensitive to the nuances of different communities and avoid generating outputs that perpetuate harmful stereotypes. This requires training LLMs not only on diverse datasets but also on datasets that capture the subtleties of different cultural and social practices, languages, and dialects. Incorporating culture-level context into LLMs could further enhance their ability to generate fair and unbiased content, improving overall safety in global applications.
1.1.1 Training DataValue Misalignment
99.9 OtherValue Misalignment > Privacy protection
1 AI SystemValue Misalignment > Methods for mitigating toxicity
1 AI SystemValue Misalignment > Methods for mitigating LLM amorality
1 AI SystemRobustness to attack
1 AI SystemRobustness to attack > Red teaming
Red teaming is widely used in LLMs to explore their safety vulnerabilities prior to the deployment of them. Red teaming can be broadly categorized into two distinct types: manual red teaming and automated red teaming
2.2.2 Testing & EvaluationLarge Language Model Safety: A Holistic Survey
Shi, Dan; Shen, Tianhao; Huang, Yufei; Li, Zhigen; Leng, Yongqi; Jin, Renren; Liu, Chuang; Wu, Xinwei; Guo, Zishan; Yu, Linhao; Shi, Ling; Jiang, Bojian; Xiong, Deyi (2024)
The rapid development and deployment of large language models (LLMs) have introduced a new frontier in artificial intelligence, marked by unprecedented capabilities in natural language understanding and generation. However, the increasing integration of these models into critical applications raises substantial safety concerns, necessitating a thorough examination of their potential risks and associated mitigation strategies. This survey provides a comprehensive overview of the current landscape of LLM safety, covering four major categories: value misalignment, robustness to adversarial attacks, misuse, and autonomous AI risks. In addition to the comprehensive review of the mitigation methodologies and evaluation resources on these four aspects, we further explore four topics related to LLM safety: the safety implications of LLM agents, the role of interpretability in enhancing LLM safety, the technology roadmaps proposed and abided by a list of AI companies and institutes for LLM safety, and AI governance aimed at LLM safety with discussions on international cooperation, policy proposals, and prospective regulatory directions. Our findings underscore the necessity for a proactive, multifaceted approach to LLM safety, emphasizing the integration of technical solutions, ethical considerations, and robust governance frameworks. This survey is intended to serve as a foundational resource for academy researchers, industry practitioners, and policymakers, offering insights into the challenges and opportunities associated with the safe integration of LLMs into society. Ultimately, it seeks to contribute to the safe and beneficial development of LLMs, aligning with the overarching goal of harnessing AI for societal advancement and well-being. A curated list of related papers has been publicly available at https://github.com/tjunlp-lab/Awesome-LLM-Safety-Papers.
Other (multiple stages)
Applies across multiple lifecycle stages
Developer
Entity that creates, trains, or modifies the AI system
Manage
Prioritising, responding to, and mitigating AI risks