This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Input validation, output filtering, and content moderation classifiers.
Also in Non-Model
Different from the methods of designing defensive prompts to preprocess the input, the malicious prompt detection method aims to detect and filter out the harmful prompts through the input safeguard
Reasoning
Detects malicious prompts and filters them at input layer before model processing.
Keyword Matching
Keyword matching is a common technique for preventing prompt hacking [63]. The basic idea of the strategy is to check for words and phrases in the initial prompt that should be blocked. LLM developers can use a blocklist (i.e., a list of words and phrases to be blocked) or an allowlist (i.e., a list of words and phrases to be allowed) to defend undesired prompts [216]–[222]. These defense mechanisms monitor the input, detecting elements that could break ethical guidelines. These guidelines cover various content types, such as sensitive information, offensive language, or hate speech.
1.2.1 Guardrails & FilteringContent Classifier
Training a classifier to detect and refuse malicious prompts is a promising approach. For example, NeMo-Guardrails [223] is an open-source toolkit developed by Nvidia to enhance LLMs with programmable guardrails. When presented with an input prompt, the jailbreak guardrail employs the Guardrails’ “input rails” to assess whether the prompt violates the LLM usage policies. If the prompt is found to breach these policies, the guardrail will reject the question, ensuring a safe conversation scenario. Generally, the key behind a prompt classifier is to carefully design the input features of the classifier. Recently, the trajectory of latent predictions in LLMs has been demonstrated to be a useful feature for training a malicious prompt detector [224], [225].
1.2.1 Guardrails & FilteringMitigation in Input Modules
Mitigating the threat posed by the input module presents a significant challenge for LLM developers due to the diversity of the harmful inputs and adversarial prompts [209], [210].
1.2.1 Guardrails & FilteringMitigation in Input Modules > Defensive Prompt Design
Directly modifying the input prompts is a viable approach to steer the behavior of the model and foster the generation of responsible outputs. This method integrates contextual information or constraints in the prompts to provide background knowledge and guidelines while generating the output [22].
1.2.1 Guardrails & FilteringMitigation in Language Models
This section delves into mitigating risks associated with models, encompassing privacy preservation, detoxification and debiasing, mitigation of hallucinations, and defenses against model attacks.
1.1 ModelMitigation in Language Models > Privacy Preserving
Privacy leakage is a crucial risk of LLMs, since the powerful memorization and association capabilities of LLMs raise the risk of revealing private information within the training data. Researchers are devoted to designing privacypreserving frameworks in LLMs [226], [227], aiming to safeguard sensitive PII from possible disclosure during humanmachine conservation
1.1 ModelMitigation in Language Models > Detoxifying and Debiasing
To reduce the toxicity and bias of LLMs, prior efforts mainly focus on enhancing the quality of training data and conducting safety training.
1.1 ModelMitigation in Language Models > Hallucination Mitigation
Hallucinations, one of the key challenges associated with LLMs, have received extensive studies.
1.2.9 OtherRisk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
Cui, Tianyu; Wang, Yanling; Fu, Chuanpu; Xiao, Yong; Li, Sijia; Deng, Xinhao; Liu, Yunpeng; Zhang, Qinglin; Qiu, Ziyi; Li, Peiyang; Tan, Zhixing; Xiong, Junwu; Kong, Xinyu; Wen, Zujie; Xu, Ke; Li, Qi (2024)
Despite their impressive capabilities, large lan- guage models (LLMs) have been observed to generate responses that include inaccurate or fabricated information, a phenomenon com- monly known as “hallucination”. In this work, we propose a simple Induce-then-Contrast De- coding (ICD) strategy to alleviate hallucina- tions. We first construct a factually weak LLM by inducing hallucinations from the original LLMs. Then, we penalize these induced hallu- cinations during decoding to enhance the fac- tuality of the generated content. Concretely, we determine the final next-token predictions by amplifying the predictions from the orig- inal model and downplaying the induced un- truthful predictions via contrastive decoding. Experimental results on both discrimination- based and generation-based hallucination eval- uation benchmarks, such as TruthfulQA and FACTSCORE, demonstrate that our proposed ICD methods can effectively enhance the factu- ality of LLMs across various model sizes and families. For example, when equipped with ICD, Llama2-7B-Chat and Mistral-7B-Instruct achieve performance comparable to ChatGPT and GPT4 on TruthfulQA, respectively.
Operate and Monitor
Running, maintaining, and monitoring the AI system post-deployment
Deployer
Entity that integrates and deploys the AI system for end users
Manage
Prioritising, responding to, and mitigating AI risks
Primary
4 Malicious Actors & Misuse