This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Output attribution, content watermarking, and AI detection mechanisms.
Also in Non-Model
With the assistance of LLMs, we can obtain LLM-generated texts that resemble human writing. Adding watermarks to these texts could be an effective way to avoid the abuse issue. Watermarking offers promising potential for ownership verification mechanisms for effective government compliance management in the LLM-generated content era. Concretely, watermarks are visible or hidden identifiers [384].
For example, when interacting with an LLM system, the output text may include specific prefixes, such as “As an artificial intelligence assistant, ...”, to indicate that the text is generated by an LLM system. However, these visible watermarks are easy to remove. Therefore, watermarks are embedded as hidden patterns in texts that are imperceptible to humans [385]–[391]. For instance, watermarks can be integrated by substituting select words with their synonyms or making nuanced adjustments to the vertical positioning of text lines without altering the semantics of the original text [389]. A representative method involves using the hash value of preceding tokens to generate a random seed [385]. This seed is then employed to divide tokens into two categories: a “green” list and a “red” list. This process encourages a watermarked LLM to preferentially sample tokens from the “green” list, continuing this selection process until a complete sentence is embedded. However, this method has recently been demonstrated to have limitations [392], because it is easy for an attacker to break watermarking mechanisms [393]. To address these challenges, a unified formulation of statistical watermarking based on hypothesis testing [394], explores the trade-off between error types and achieving near-optimal rates in the i.i.d. setting. By establishing a theoretical foundation for existing and future statistical watermarking, it offers a unified and systematic approach to evaluating the statistical guarantees of both existing and future watermarking methodologies. Furthermore, the accomplishments of blockchain in copyright are introduced [395], utilizing blockchain to enhance LLMgenerated content reliability through a secure and transparent verification mechanism.
Reasoning
Watermarks embedded in outputs enable attribution and detection of AI-generated content.
Mitigation in Input Modules
Mitigating the threat posed by the input module presents a significant challenge for LLM developers due to the diversity of the harmful inputs and adversarial prompts [209], [210].
1.2.1 Guardrails & FilteringMitigation in Input Modules > Defensive Prompt Design
Directly modifying the input prompts is a viable approach to steer the behavior of the model and foster the generation of responsible outputs. This method integrates contextual information or constraints in the prompts to provide background knowledge and guidelines while generating the output [22].
1.2.1 Guardrails & FilteringMitigation in Input Modules > Malicious Prompt Detection
Different from the methods of designing defensive prompts to preprocess the input, the malicious prompt detection method aims to detect and filter out the harmful prompts through the input safeguard
1.2.1 Guardrails & FilteringMitigation in Language Models
This section delves into mitigating risks associated with models, encompassing privacy preservation, detoxification and debiasing, mitigation of hallucinations, and defenses against model attacks.
1.1 ModelMitigation in Language Models > Privacy Preserving
Privacy leakage is a crucial risk of LLMs, since the powerful memorization and association capabilities of LLMs raise the risk of revealing private information within the training data. Researchers are devoted to designing privacypreserving frameworks in LLMs [226], [227], aiming to safeguard sensitive PII from possible disclosure during humanmachine conservation
1.1 ModelMitigation in Language Models > Detoxifying and Debiasing
To reduce the toxicity and bias of LLMs, prior efforts mainly focus on enhancing the quality of training data and conducting safety training.
1.1 ModelRisk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
Cui, Tianyu; Wang, Yanling; Fu, Chuanpu; Xiao, Yong; Li, Sijia; Deng, Xinhao; Liu, Yunpeng; Zhang, Qinglin; Qiu, Ziyi; Li, Peiyang; Tan, Zhixing; Xiong, Junwu; Kong, Xinyu; Wen, Zujie; Xu, Ke; Li, Qi (2024)
Despite their impressive capabilities, large lan- guage models (LLMs) have been observed to generate responses that include inaccurate or fabricated information, a phenomenon com- monly known as “hallucination”. In this work, we propose a simple Induce-then-Contrast De- coding (ICD) strategy to alleviate hallucina- tions. We first construct a factually weak LLM by inducing hallucinations from the original LLMs. Then, we penalize these induced hallu- cinations during decoding to enhance the fac- tuality of the generated content. Concretely, we determine the final next-token predictions by amplifying the predictions from the orig- inal model and downplaying the induced un- truthful predictions via contrastive decoding. Experimental results on both discrimination- based and generation-based hallucination eval- uation benchmarks, such as TruthfulQA and FACTSCORE, demonstrate that our proposed ICD methods can effectively enhance the factu- ality of LLMs across various model sizes and families. For example, when equipped with ICD, Llama2-7B-Chat and Mistral-7B-Instruct achieve performance comparable to ChatGPT and GPT4 on TruthfulQA, respectively.
Build and Use Model
Training, fine-tuning, and integrating the AI model
Developer
Entity that creates, trains, or modifies the AI system
Manage
Prioritising, responding to, and mitigating AI risks