This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Modifications to training data composition, quality, and filtering that affect what the model learns.
Also in Model
Reasoning
Mitigation name lacks description or evidence; insufficient detail to identify focal activity or implementation mechanism.
Thorn AIG-CSAM.
Closed source dataset of prompts and configuration parameters known to generate AIG-CSAM.
3.2.1 Benchmarks & EvaluationPAN12.
Training data for identifying sexual predators in chat.
1.1.1 Training DataHelpfulness and harmlessness data
A repository of human preference data to enable RL for guiding models towards helpful and away from harmful outputs.
1.1.2 Learning ObjectivesVLGaurd
A fine-tuning dataset for safety alignment of vision-language models (VLMs)
1.1.1 Training DataAegis AI Content Safety Dataset 1.0
11k manually annotated interactions between humans and LLMs covering content safety topics
3.2.1 Benchmarks & EvaluationBeaverTails
A dataset of 60k examples of helpful and harmful data used for content moderation and RLHF.
1.1.1 Training DataBOLD
A dataset of 23,679 English text generation prompts for bias benchmarking across five domains: profession, gender, race, religion, and political ideology.
3.2.1 Benchmarks & EvaluationWinobias
A dataset of 3,160 sentences, for coreference resolution focused on gender bias.
3.2.1 Benchmarks & EvaluationWinogender
A dataset of sentence pairs that differ solely by the gender of one pronoun in the sentence, designed to test for the presence of gender bias in automated coreference resolution systems.
3.2.1 Benchmarks & EvaluationCrowS-Pairs
A dataset of 1508 examples that cover stereotypes across nine types of biases such as race, religion, or age.
3.2.1 Benchmarks & EvaluationEnterprisePII
A dataset to identify confidential data in various business documents, such as meeting notes, commercial contracts, marketing emails, performance reviews etc
3.2.1 Benchmarks & EvaluationOpenPII-300k
Use cases split across education, health, and psychology, academic only licenses.
3.2.1 Benchmarks & EvaluationMultilingual finance PII
20 PII classes, support 7 language, finance specific, Apache 2 license.
3.2.1 Benchmarks & EvaluationBigcode PII
Crowdsourced PII dataset in code, 31 programming languages.
3.2.1 Benchmarks & EvaluationGretel Navigator
(closed source)
1.1.1 Training DataFlair
Cleaning tool-Online filtering plus scrubbing tools
1.2.1 Guardrails & FilteringSpacy Presido
Cleaning tool-Online filtering plus scrubbing toolsCleaning tool-Online
1.2.1 Guardrails & FilteringJBB-Behaviors
100 distinct misuse behaviors
3.2.1 Benchmarks & EvaluationJasperl S/promt-injections
3.2.1 Benchmarks & Evaluationwalledai/JailbreakHub:
1,405 jailbreak prompts out of 15,140 prompts from four platforms (Reddit, Discord, websites, and open-source datasets)
3.2.1 Benchmarks & EvaluationMaliciousInstruct.txt -Princeton-SysML/Jailbreak_L LM
Contains 100 malicious instructions of 10 different malicious intents for evaluation.
3.2.1 Benchmarks & EvaluationChild safety
Child Harm (including but not limited to grooming, minor sexualization, and illegal content such as Child Sexual Abuse Material, or CSAM). Note, this category is separated from other content safety issues given CSAM is illegal to possess, share, or distribute in many jurisdictions. It poses unique challenges for testing and mitigations implementation. 35
99 OtherContent safety
Content policies are specific to an AI system and its use cases. 37 A common example is the MLCommons AILuminate taxonomy. 38 Categories of content safety include but are not limited to: Violent Crimes, Non-violent Crimes, Sex-related Crimes, Child Sexual Exploitation, Indiscriminate Weapons, Suicide & Self-Harm, Hate, Specialized Advice, Defamation.
1.2.1 Guardrails & FilteringBias / Discrimination (alternatively, Legal and Rights Related)
Generation of content and/or predictive decisions that are biased, discriminatory and/or inconsistent; related to sensitive characteristics such as race, ethnicity, gender, nationality, income, sexual orientation, ability, and political or religious belief.
99.9 OtherInformation risks (Privacy infringement)
Leaking, generating, or correctly inferring private and personal information about individuals.
99.9 OtherModel Integrity risks
In-Scope for our work: Basic adversarial aacks like simple jailbreaking remain a focus of our collective work as they are a common threat faced by AI systems. This guidance from NIST reviews typical aack vectors like jailbreaks and data extraction, and includes mitigations. Some of the aack types in the NIST guidance may be out of scope, such as deliberate actions by motivated, experienced adversaries aiming to disrupt, evade, compromise, or abuse the operation of the model or its output.
1.2.1 Guardrails & FilteringModel tuning
1.1 ModelA Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety
François, Camille; Péran, Ludovic; Bdeir, Ayah; Dziri, Nouha; Hawkins, Will; Jernite, Yacine; Kapoor, Sayash; Shen, Juliet; Khlaaf, Heidy; Klyman, Kevin; Marda, Nik; Pellat, Marie; Raji, Deb; Siddarth, Divya; Skowron, Aviya; Spisak, Joseph; Srikumar, Madhulika; Storchan, Victor; Tang, Audrey; Weedon, Jen (2025)
The rapid rise of open-weight and open-source foundation models is intensifying the obligation and reshaping the opportunity to make AI systems safe. This paper reports outcomes from the Columbia Convening on AI Openness and Safety (San Francisco, 19 Nov 2024) and its six-week preparatory programme involving more than forty-five researchers, engineers, and policy leaders from academia, industry, civil society, and government. Using a participatory, solutions-oriented process, the working groups produced a research agenda at the intersection of safety and open source AI, a mapping of existing and needed technical interventions and open source tools.
Collect and Process Data
Gathering, curating, labelling, and preprocessing training data
Developer
Entity that creates, trains, or modifies the AI system
Other
Risk management function not captured by the standard AIRM categories