This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Input validation, output filtering, and content moderation classifiers.
Also in Non-Model
There have been many implementation solutions for guardrails.
OpenAI Moderation Endpoint [482] is an API released by OpenAI to check whether an LLM response is aligned with OpenAI usage policy17. The endpoint relies on a multi-label classifier that classifies the response into 11 categories such as violence, sexuality, hate, and harassment. If the response violates any of these categories, the response is flagged as violating OpenAI’s usage policy. OpenChatKit Moderation Model [687] is fine-tuned from GPT-JT-6B on OIG (Open Instruction Generalist)18 moderation dataset. This moderation model classifies user input into five categories: casual, possibly needs caution, needs caution, probably needs caution, and needs intervention. Responses are delivered only if the user input does not fall into the “needs intervention” category. Llama guard [330] employs a Llama2-7b model as the guardrails, which are instruction-tuned on a red-teaming dataset. These guardrails output “safe” or “unsafe”, both of which are single tokens in the SentencePiece tokenizer. If the model assessment is “unsafe”, then the guardrail further outputs the policies that are violated. Nvidia NeMo [597] provides a programmable interface for users to establish their custom guardrails using Colang, a modeling language designed to specify dialogue flows and safety guardrails for conversational systems. Users provide a Colang script that defines dialogue flows, users, and bot canonical forms, which are presented in natural language. All these user-defined Colang elements are encoded and stored in a vector database. When NeMo receives a user input, it encodes this input as a vector and looks up the nearest neighbours among the stored vector-based user canonical forms. If NeMo finds an “ideal” canonical form, the corresponding flow execution is activated, guiding the subsequent conversation. Guardrails AI [8] is another framework that allows users to select and define their guardrails. It provides a collection of pre-built measures of specific types of risks (called “validators”) downloadable from Guardrails Hub19. Users can choose multiple validators to intercept the inputs and outputs of LLMs, and they also have the option to develop their validators and contribute them to Guardrails Hub. Despite the development of various input and output modules designed to safeguard LLMs, these protections are typically insufficient to reduce harmful content [322, 638], particularly when challenged by rapidly evolving jailbreak attacks. This ineffectiveness is supported by theoretical research on guardrail system [244], which posits the impossibility of fully censoring outputs. This limitation can be attributed to the concept of "invertible string transformations", wherein arbitrary transformations elude content filters and can subsequently be reversed by the attacker. Furthermore, the integration of safeguard modules introduces extra computational overhead, thereby increasing the system processing times. In real-time applications, where speed and efficiency are crucial, developers may face challenges in balancing safety and latency requirements.
Reasoning
Runtime input/output filtering classifiers block responses violating safety policies before delivery.
Red Teaming
Red teaming is a critical defence mechanism to proactively discover vulnerabilities and risks in LLMs. This process provides developers with clues and insights into the weaknesses of LLMs, paving the way for the development of more advanced and secure models. Red teaming involves meticulously crafting adversarial prompts to simulate attacks and deliberately challenge the models. These prompts can be generated through manual methods, which rely on human expertise and creativity, or automatic methods, which leverage red LLMs to systematically explore the model’s weaknesses.
2.2.2 Testing & EvaluationRed Teaming > Manual Red Teaming
Manual red-teaming approaches refer to employing crowdworkers to annotate or handcraft adversarial test cases. The underlying methodology is to develop a human-and-model-in-the-loop system, where humans are tasked to adversarially converse with language models [50, 221, 362, 532, 710, 711, 769, 770]. Specifically, workers interact with language models through a dedicated user interface that allows them to observe model predictions and construct data that exposes model failures. This process may include multiple rounds where the model is updated with the adversarial data collected thus far and redeployed; this encourages workers to craft increasingly challenging examples.
2.2.2 Testing & EvaluationRed Teaming > LLMs as Red Teamers
In the SL approach, red LLMs are fine-tuned to maximize the log-likelihood of failing, zero-shot test cases. For RL, the models are initialized from the SL-trained models and then fine-tuned using the synchronous advantage actor-critic (A2C) [505] to enhance the elicitation of harmful prompts
2.2.2 Testing & EvaluationSafety Training
Safety training aims to enhance the safety and alignment of LLMs during their development.
1.1.2 Learning ObjectivesSafety Training > Instruction Tuning
Safety training can be effectively implemented using adversarial prompts and their corresponding responsible output in an instruction-tuning framework.
1.1.2 Learning ObjectivesSafety Training > Reinforcement Learning with Human Feedback
Reinforcement Learning with Human Feedback (RLHF) is a strategy widely adopted to align with human preferences, particularly concerning ethical values.
1.1.2 Learning ObjectivesTrustworthy, Responsible, and Safe AI: A Comprehensive Architectural Framework for AI Safety with Challenges and Mitigations
Chen, Chen; Gong, Xueluan; Liu, Ziyao; Jiang, Weifeng; Goh, Si Qi; Lam, Kwok-Yan (2024)
AI Safety is an emerging area of critical importance to the safe adoption and deployment of AI systems. With the rapid proliferation of AI and especially with the recent advancement of Generative AI (or GAI), the technology ecosystem behind the design, development, adoption, and deployment of AI systems has drastically changed, broadening the scope of AI Safety to address impacts on public safety and national security. In this paper, we propose a novel architectural framework for understanding and analyzing AI Safety; defining its characteristics from three perspectives: Trustworthy AI, Responsible AI, and Safe AI. We provide an extensive review of current research and advancements in AI safety from these perspectives, highlighting their key challenges and mitigation approaches. Through examples from state-of-the-art technologies, particularly Large Language Models (LLMs), we present innovative mechanism, methodologies, and techniques for designing and testing AI safety. Our goal is to promote advancement in AI safety research, and ultimately enhance people's trust in digital transformation.
Build and Use Model
Training, fine-tuning, and integrating the AI model
Developer
Entity that creates, trains, or modifies the AI system
Manage
Prioritising, responding to, and mitigating AI risks