BackSafety Decoding

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Safety Decoding

Chen (2024)|LLM classified

Mitigation Taxonomy

1AI System

1.2Non-Model

1.2.9Other

Non-Model mitigations not clearly fitting above categories.

Also in Non-Model

1.2.1 Guardrails & Filtering1.2.2 Runtime Environment1.2.3 Monitoring & Detection1.2.4 Security Infrastructure1.2.5 Provenance & Watermarking

Definitionp. 42

LLMs often employ Transformer architecture [700], which performs inference in an auto-regressive manner [672]. This manner suffers from error propagation, which means if an error occurs in generating an early part of the sequence, it can affect all subsequent parts, with limited opportunities to revise it. This error propagation can lead to increasingly unsafe and misaligned outputs as the sequence progresses.

Additional Informationp. 42

The Rewindable Auto-regressive INference (RAIN) [413] method addresses this issue by alternating between forward steps, self-evaluation steps, and backward steps. Specifically, in the forward step, RAIN selects the next token sets from the candidates generated in the previous iteration based on their safety scores and the levels of exploration. Subsequently, the identical LLM is prompted to self-evaluate the current text, updating safety scores and visit counts for future calculation of exploration scores. Finally, in the backward step, RAIN generates multiple candidate token sets to prepare for the next iteration. Additionally, SUBMIX [242] addresses the need for privacy-preserving text generation by introducing an ensemble approach. This method involves fine-tuning multiple models on separate segments of a private dataset. The next-token distributions of these models are then mixed with that of a publicly pre-trained LM to predict tokens. This ensemble approach is based on the finding that mixing token distribution from specialized models with a generalist model reduces the risk of privacy leaks, as no single model directly processes the entire private dataset.

LLM Classification Details

Reasoning

Decoding techniques filter token generation at runtime to ensure safe, aligned outputs.

Code: 1.2.1Version: v0.5Classified: Jan 22, 2026

Other mitigations from Chen (2024) (19)

Red Teaming

Red teaming is a critical defence mechanism to proactively discover vulnerabilities and risks in LLMs. This process provides developers with clues and insights into the weaknesses of LLMs, paving the way for the development of more advanced and secure models. Red teaming involves meticulously crafting adversarial prompts to simulate attacks and deliberately challenge the models. These prompts can be generated through manual methods, which rely on human expertise and creativity, or automatic methods, which leverage red LLMs to systematically explore the model’s weaknesses.

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Red Teaming > Manual Red Teaming

Manual red-teaming approaches refer to employing crowdworkers to annotate or handcraft adversarial test cases. The underlying methodology is to develop a human-and-model-in-the-loop system, where humans are tasked to adversarially converse with language models [50, 221, 362, 532, 710, 711, 769, 770]. Specifically, workers interact with language models through a dedicated user interface that allows them to observe model predictions and construct data that exposes model failures. This process may include multiple rounds where the model is updated with the adversarial data collected thus far and redeployed; this encourages workers to craft increasingly challenging examples.

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Red Teaming > LLMs as Red Teamers

In the SL approach, red LLMs are fine-tuned to maximize the log-likelihood of failing, zero-shot test cases. For RL, the models are initialized from the SL-trained models and then fine-tuned using the synchronous advantage actor-critic (A2C) [505] to enhance the elicitation of harmful prompts

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:OtherAIRM:Measure

Safety Training

Safety training aims to enhance the safety and alignment of LLMs during their development.

1.1.2 Learning Objectives

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Safety Training > Instruction Tuning

Safety training can be effectively implemented using adversarial prompts and their corresponding responsible output in an instruction-tuning framework.

1.1.2 Learning Objectives

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Safety Training > Reinforcement Learning with Human Feedback

Reinforcement Learning with Human Feedback (RLHF) is a strategy widely adopted to align with human preferences, particularly concerning ethical values.

1.1.2 Learning Objectives

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

View all 19 mitigations from this source →

Source Document

Trustworthy, Responsible, and Safe AI: A Comprehensive Architectural Framework for AI Safety with Challenges and Mitigations

Chen, Chen; Gong, Xueluan; Liu, Ziyao; Jiang, Weifeng; Goh, Si Qi; Lam, Kwok-Yan (2024)

AI Safety is an emerging area of critical importance to the safe adoption and deployment of AI systems. With the rapid proliferation of AI and especially with the recent advancement of Generative AI (or GAI), the technology ecosystem behind the design, development, adoption, and deployment of AI systems has drastically changed, broadening the scope of AI Safety to address impacts on public safety and national security. In this paper, we propose a novel architectural framework for understanding and analyzing AI Safety; defining its characteristics from three perspectives: Trustworthy AI, Responsible AI, and Safe AI. We provide an extensive review of current research and advancements in AI safety from these perspectives, highlighting their key challenges and mitigation approaches. Through examples from state-of-the-art technologies, particularly Large Language Models (LLMs), we present innovative mechanism, methodologies, and techniques for designing and testing AI safety. Our goal is to promote advancement in AI safety research, and ultimately enhance people's trust in digital transformation.

View source DOI: 10.48550/arXiv.2408.12935

Classification

AI Lifecycle Stage

Build and Use Model

Training, fine-tuning, and integrating the AI model

Responsible Actor

Developer

Entity that creates, trains, or modifies the AI system

NIST AI RMF Function

Manage

Prioritising, responding to, and mitigating AI risks

Risk Domains

Primary

7.1 AI pursuing its own goals in conflict with human goals or values

Other

7.3 Lack of capability or robustness