This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Input validation, output filtering, and content moderation classifiers.
Also in Non-Model
Reasoning
Filters content in real-time before or after AI system processing.
Curated PJ.
Curated dataset of predatory chats from Perverted Justice.
1.1.1 Training DataHaidra Horde Safety
Uses CLIP+BLIP for safety features for the AI Horde.
1.2.1 Guardrails & FilteringPDQ:
Open source hashing algorithm to detect known CSAM
1.2.5 Provenance & WatermarkingPhotoDNA:
Hashing algorithm to detect known CSAM
1.2.5 Provenance & WatermarkingCybertip API:
Tool to report child exploitation to NCMEC
1.2.9 OtherIWF URL List:
Removes URLs related to CSAM
1.2.1 Guardrails & FilteringLlama-Guard
LLM-based input-output safeguard models geared towards Human-AI conversation use cases.
1.2.1 Guardrails & FilteringShieldGemma
A set of instruction tuned models for evaluating the safety of text prompt input and text output responses against a set of safety policies.
1.2.1 Guardrails & FilteringPerspective API
Classifiers which score content based on issues including toxicity, sexually explicit content, and threats. Can be used as a mitigation or evaluation tool.
1.2.1 Guardrails & FilteringPapillon filter
Uses LLM to create privacy preserving queries for the user.
1.2.1 Guardrails & FilteringPresidio
Provides fast identification and anonymization modules for private entities in text and images such as credit card numbers, names, locations etc
1.2.1 Guardrails & FilteringPII processing
Multilingual Named Entity Recognition and PII processor.
1.2.1 Guardrails & FilteringScrub
Provides multiple levels of scrubbing with ML to ensure optimal anonymization of sensitive information and safeguarding of user privacy
1.2.1 Guardrails & FilteringOctopii
uses OCR, regex lists NLP to search public-facing locations or Government ID, addresses, emails etc in images, PDFs and documents.
1.2.1 Guardrails & FilteringGitleaks
tool for detecting secrets like passwords, API keys, and tokens in git repos.
1.2.3 Monitoring & DetectionPrompt Guard
A classifier model trained on a large corpus of aacks capable of detecting both explicitly malicious prompts and prompts that contain injected inputs.
1.2.1 Guardrails & FilteringPrompt-Guard-86M
Detects both explicitly malicious prompts as well as data that contains injected inputs.
1.2.1 Guardrails & Filteringprotectai/deberta-v3-base-prompt-injection-v2
Detects and classifies prompt injection aacks that can manipulate language models into producing unintended outputs.
1.2.1 Guardrails & Filteringdeepset/deberta-v3-base-injection
Prompt injection detection and classification Programmable guardrails: though they are not specific to prompt injection and jailbreaking, they can allow to filter the inputs and outputs against these aacks.
1.2.1 Guardrails & FilteringChild safety
Child Harm (including but not limited to grooming, minor sexualization, and illegal content such as Child Sexual Abuse Material, or CSAM). Note, this category is separated from other content safety issues given CSAM is illegal to possess, share, or distribute in many jurisdictions. It poses unique challenges for testing and mitigations implementation. 35
99 OtherContent safety
Content policies are specific to an AI system and its use cases. 37 A common example is the MLCommons AILuminate taxonomy. 38 Categories of content safety include but are not limited to: Violent Crimes, Non-violent Crimes, Sex-related Crimes, Child Sexual Exploitation, Indiscriminate Weapons, Suicide & Self-Harm, Hate, Specialized Advice, Defamation.
1.2.1 Guardrails & FilteringBias / Discrimination (alternatively, Legal and Rights Related)
Generation of content and/or predictive decisions that are biased, discriminatory and/or inconsistent; related to sensitive characteristics such as race, ethnicity, gender, nationality, income, sexual orientation, ability, and political or religious belief.
99.9 OtherInformation risks (Privacy infringement)
Leaking, generating, or correctly inferring private and personal information about individuals.
99.9 OtherModel Integrity risks
In-Scope for our work: Basic adversarial aacks like simple jailbreaking remain a focus of our collective work as they are a common threat faced by AI systems. This guidance from NIST reviews typical aack vectors like jailbreaks and data extraction, and includes mitigations. Some of the aack types in the NIST guidance may be out of scope, such as deliberate actions by motivated, experienced adversaries aiming to disrupt, evade, compromise, or abuse the operation of the model or its output.
1.2.1 Guardrails & FilteringData for tuning & evaluation
1.1.1 Training DataA Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety
François, Camille; Péran, Ludovic; Bdeir, Ayah; Dziri, Nouha; Hawkins, Will; Jernite, Yacine; Kapoor, Sayash; Shen, Juliet; Khlaaf, Heidy; Klyman, Kevin; Marda, Nik; Pellat, Marie; Raji, Deb; Siddarth, Divya; Skowron, Aviya; Spisak, Joseph; Srikumar, Madhulika; Storchan, Victor; Tang, Audrey; Weedon, Jen (2025)
The rapid rise of open-weight and open-source foundation models is intensifying the obligation and reshaping the opportunity to make AI systems safe. This paper reports outcomes from the Columbia Convening on AI Openness and Safety (San Francisco, 19 Nov 2024) and its six-week preparatory programme involving more than forty-five researchers, engineers, and policy leaders from academia, industry, civil society, and government. Using a participatory, solutions-oriented process, the working groups produced a research agenda at the intersection of safety and open source AI, a mapping of existing and needed technical interventions and open source tools.
Operate and Monitor
Running, maintaining, and monitoring the AI system post-deployment
Deployer
Entity that integrates and deploys the AI system for end users
Manage
Prioritising, responding to, and mitigating AI risks