This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Input validation, output filtering, and content moderation classifiers.
Also in Non-Model
Defensive prompts are a straightforward approach to prevent harmful outputs from LLMs.
Early tactics in promptbased defenses involve manipulating the prompts to prevent specific types of attacks. For example, simple strategies such as post-prompting [574] and the sandwich defense [575] can effectively guard against goal-hijacking attacks. Some other methods [261, 269] attempt to parameterize the different components of the prompts and structure user input into formats, such as quotes or JSON. This structuring strategy provides indicators to LLMs to distinguish user inputs from instructions, thereby reducing the influence of adversarial inputs on the model’s behavior. Additionally, protective system prompts could be crafted to enhance the safety of instructions. For instance, LLaMA2 [690] incorporates safe and positive words like “responsible”, “respectful,” or “wise” in the system prompt to imbue the model with positive traits. Recent works have explored the use of emergent capabilities in LLMs, e.g., In-Context Learning (ICL) [81] and chain-of-thought (CoT) [739] reasoning, to develop defensive prompts. Inspired by the In-Context Attack (ICA) which employs harmful demonstrations to undermine LLMs, In-Context Defense (ICD) [741] prompt technique aims to enhance model resilience by integrating well-behaved demonstrations that refuse harmful responses. Another study [491] that uses the ICD framework considers the diversity of user input and the adaptability of demonstrations. This research introduces a retrieval-based method that dynamically retrieves from a collection of demonstrations with safe responses, making the defensive prompt more tailored and relevant to specific user input. Furthermore, the Intention Analysis (IA) strategy [824] employs a CoT-like method that decomposes the generation process into two stages: IA first prompts LLMs to identify the underlying intention of the user query and then uses this dialogue along with a pre-defined policy to guide LLMs to generate the final response. Despite these prompt-based defence approaches are not complete solutions and do not offer guarantees, they present a relatively efficient strategy to prevent LLM misbehavior.
Reasoning
Defensive prompts structure and filter inputs to prevent harmful model outputs.
Red Teaming
Red teaming is a critical defence mechanism to proactively discover vulnerabilities and risks in LLMs. This process provides developers with clues and insights into the weaknesses of LLMs, paving the way for the development of more advanced and secure models. Red teaming involves meticulously crafting adversarial prompts to simulate attacks and deliberately challenge the models. These prompts can be generated through manual methods, which rely on human expertise and creativity, or automatic methods, which leverage red LLMs to systematically explore the model’s weaknesses.
2.2.2 Testing & EvaluationRed Teaming > Manual Red Teaming
Manual red-teaming approaches refer to employing crowdworkers to annotate or handcraft adversarial test cases. The underlying methodology is to develop a human-and-model-in-the-loop system, where humans are tasked to adversarially converse with language models [50, 221, 362, 532, 710, 711, 769, 770]. Specifically, workers interact with language models through a dedicated user interface that allows them to observe model predictions and construct data that exposes model failures. This process may include multiple rounds where the model is updated with the adversarial data collected thus far and redeployed; this encourages workers to craft increasingly challenging examples.
2.2.2 Testing & EvaluationRed Teaming > LLMs as Red Teamers
In the SL approach, red LLMs are fine-tuned to maximize the log-likelihood of failing, zero-shot test cases. For RL, the models are initialized from the SL-trained models and then fine-tuned using the synchronous advantage actor-critic (A2C) [505] to enhance the elicitation of harmful prompts
2.2.2 Testing & EvaluationSafety Training
Safety training aims to enhance the safety and alignment of LLMs during their development.
1.1.2 Learning ObjectivesSafety Training > Instruction Tuning
Safety training can be effectively implemented using adversarial prompts and their corresponding responsible output in an instruction-tuning framework.
1.1.2 Learning ObjectivesSafety Training > Reinforcement Learning with Human Feedback
Reinforcement Learning with Human Feedback (RLHF) is a strategy widely adopted to align with human preferences, particularly concerning ethical values.
1.1.2 Learning ObjectivesTrustworthy, Responsible, and Safe AI: A Comprehensive Architectural Framework for AI Safety with Challenges and Mitigations
Chen, Chen; Gong, Xueluan; Liu, Ziyao; Jiang, Weifeng; Goh, Si Qi; Lam, Kwok-Yan (2024)
AI Safety is an emerging area of critical importance to the safe adoption and deployment of AI systems. With the rapid proliferation of AI and especially with the recent advancement of Generative AI (or GAI), the technology ecosystem behind the design, development, adoption, and deployment of AI systems has drastically changed, broadening the scope of AI Safety to address impacts on public safety and national security. In this paper, we propose a novel architectural framework for understanding and analyzing AI Safety; defining its characteristics from three perspectives: Trustworthy AI, Responsible AI, and Safe AI. We provide an extensive review of current research and advancements in AI safety from these perspectives, highlighting their key challenges and mitigation approaches. Through examples from state-of-the-art technologies, particularly Large Language Models (LLMs), we present innovative mechanism, methodologies, and techniques for designing and testing AI safety. Our goal is to promote advancement in AI safety research, and ultimately enhance people's trust in digital transformation.
Other (general)
General mitigation not specific to a single lifecycle stage
Unable to classify
Could not be classified to a specific actor type
Manage
Prioritising, responding to, and mitigating AI risks
Primary
4 Malicious Actors & Misuse