This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Modifications to training data composition, quality, and filtering that affect what the model learns.
Also in Model
Another effective approach to handle distributional discrepancy is expanding the diversity of the training data. This method aims to align the training data distribution more closely with the real-world data distribution. Key data distribution intervention techniques include adversarial training and cooperative training.
Adversarial training is a safety training tactic (see Section 6.2) that incorporates adversarial examples into the training process, highlighting scenarios where the AI system fails to align with human intentions. In the context of data distribution intervention, these adversarial examples refer to the out-of-distributional instances that lie in the regions between the boundaries of training and real-world data distributions [341]. Training on such data could reinforce areas where AI systems are vulnerable [37, 796], enhancing their robustness in real-world applications. Adversarial examples can be constructed in various ways. One straightforward approach is to add small perturbations to inputs, which preserves their original labels while introducing adversarial characteristics [100, 260, 300, 504]. Another effective strategy is red teaming, which usually involves human teams systematically testing to find vulnerabilities in the AI system (see Section 6.1) [424]. Additionally, adversarial techniques such as Variational Auto-encoder (VAE) [367] or GANs [259] can automatically generate synthetic adversarial examples [478, 573, 855]. Beyond introducing adversarial training data, optimization techniques can further improve the effectiveness of adversarial training. These techniques include adding regularization terms to the loss function [260] and employing curriculum learning strategies during training [814]. Cooperative training incorporates multiple agents into the training process, mirroring real-world scenarios where collaboration is essential for achieving common goals [156]. The training data adopted by this approach can enhance the AI system’s generalization and robustness [341]. Combining cooperative training with Reinforcement Learning (RL) is referred to as Multi-Agent Reinforcement Learning (MARL). Based on the degree of cooperation among agents, various methods have been developed within the MARL framework. In fully Cooperative MARL, all agents share the same objectives, emphasizing coordination over competition [265]. The training focuses on strategies that facilitate collective problem-solving and goal achievement. Mixed-Motive MARL reflects a blend of cooperative and competitive incentives, where agents have aligned but distinct goals [265]. Zero-shot coordination aims for AI systems to effectively coordinate with unknown agents, mirroring human capabilities to cooperate with new partners [310, 691].
Reasoning
Modifies training data composition by incorporating adversarial and out-of-distribution examples to improve model robustness.
Red Teaming
Red teaming is a critical defence mechanism to proactively discover vulnerabilities and risks in LLMs. This process provides developers with clues and insights into the weaknesses of LLMs, paving the way for the development of more advanced and secure models. Red teaming involves meticulously crafting adversarial prompts to simulate attacks and deliberately challenge the models. These prompts can be generated through manual methods, which rely on human expertise and creativity, or automatic methods, which leverage red LLMs to systematically explore the model’s weaknesses.
2.2.2 Testing & EvaluationRed Teaming > Manual Red Teaming
Manual red-teaming approaches refer to employing crowdworkers to annotate or handcraft adversarial test cases. The underlying methodology is to develop a human-and-model-in-the-loop system, where humans are tasked to adversarially converse with language models [50, 221, 362, 532, 710, 711, 769, 770]. Specifically, workers interact with language models through a dedicated user interface that allows them to observe model predictions and construct data that exposes model failures. This process may include multiple rounds where the model is updated with the adversarial data collected thus far and redeployed; this encourages workers to craft increasingly challenging examples.
2.2.2 Testing & EvaluationRed Teaming > LLMs as Red Teamers
In the SL approach, red LLMs are fine-tuned to maximize the log-likelihood of failing, zero-shot test cases. For RL, the models are initialized from the SL-trained models and then fine-tuned using the synchronous advantage actor-critic (A2C) [505] to enhance the elicitation of harmful prompts
2.2.2 Testing & EvaluationSafety Training
Safety training aims to enhance the safety and alignment of LLMs during their development.
1.1.2 Learning ObjectivesSafety Training > Instruction Tuning
Safety training can be effectively implemented using adversarial prompts and their corresponding responsible output in an instruction-tuning framework.
1.1.2 Learning ObjectivesSafety Training > Reinforcement Learning with Human Feedback
Reinforcement Learning with Human Feedback (RLHF) is a strategy widely adopted to align with human preferences, particularly concerning ethical values.
1.1.2 Learning ObjectivesTrustworthy, Responsible, and Safe AI: A Comprehensive Architectural Framework for AI Safety with Challenges and Mitigations
Chen, Chen; Gong, Xueluan; Liu, Ziyao; Jiang, Weifeng; Goh, Si Qi; Lam, Kwok-Yan (2024)
AI Safety is an emerging area of critical importance to the safe adoption and deployment of AI systems. With the rapid proliferation of AI and especially with the recent advancement of Generative AI (or GAI), the technology ecosystem behind the design, development, adoption, and deployment of AI systems has drastically changed, broadening the scope of AI Safety to address impacts on public safety and national security. In this paper, we propose a novel architectural framework for understanding and analyzing AI Safety; defining its characteristics from three perspectives: Trustworthy AI, Responsible AI, and Safe AI. We provide an extensive review of current research and advancements in AI safety from these perspectives, highlighting their key challenges and mitigation approaches. Through examples from state-of-the-art technologies, particularly Large Language Models (LLMs), we present innovative mechanism, methodologies, and techniques for designing and testing AI safety. Our goal is to promote advancement in AI safety research, and ultimately enhance people's trust in digital transformation.
Collect and Process Data
Gathering, curating, labelling, and preprocessing training data
Developer
Entity that creates, trains, or modifies the AI system
Manage
Prioritising, responding to, and mitigating AI risks