BackData for tuning & evaluation

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Data for tuning & evaluation

Francois (2025)|LLM classified

Mitigation Taxonomy

1AI System

1.1Model

1.1.1Training Data

Modifications to training data composition, quality, and filtering that affect what the model learns.

Also in Model

1.1.2 Learning Objectives1.1.3 Capability Modification1.1.4 Model Architecture

LLM Classification Details

Reasoning

Mitigation name lacks description or evidence; insufficient detail to identify focal activity or implementation mechanism.

Code: 99.9Version: v0.6Classified: Jan 27, 2026

Sub-mitigations (21)

Thorn AIG-CSAM.

Closed source dataset of prompts and configuration parameters known to generate AIG-CSAM.

3.2.1 Benchmarks & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

PAN12.

Training data for identifying sexual predators in chat.

1.1.1 Training Data

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

Helpfulness and harmlessness data

A repository of human preference data to enable RL for guiding models towards helpful and away from harmful outputs.

1.1.2 Learning Objectives

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

VLGaurd

A fine-tuning dataset for safety alignment of vision-language models (VLMs)

1.1.1 Training Data

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

Aegis AI Content Safety Dataset 1.0

11k manually annotated interactions between humans and LLMs covering content safety topics

3.2.1 Benchmarks & Evaluation

Lifecycle:Collect and Process DataActor:Infrastructure ProviderAIRM:Measure

BeaverTails

A dataset of 60k examples of helpful and harmful data used for content moderation and RLHF.

1.1.1 Training Data

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

BOLD

A dataset of 23,679 English text generation prompts for bias benchmarking across five domains: profession, gender, race, religion, and political ideology.

3.2.1 Benchmarks & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Winobias

A dataset of 3,160 sentences, for coreference resolution focused on gender bias.

3.2.1 Benchmarks & Evaluation

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Measure

Winogender

A dataset of sentence pairs that differ solely by the gender of one pronoun in the sentence, designed to test for the presence of gender bias in automated coreference resolution systems.

3.2.1 Benchmarks & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

CrowS-Pairs

A dataset of 1508 examples that cover stereotypes across nine types of biases such as race, religion, or age.

3.2.1 Benchmarks & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

EnterprisePII

A dataset to identify confidential data in various business documents, such as meeting notes, commercial contracts, marketing emails, performance reviews etc

3.2.1 Benchmarks & Evaluation

Lifecycle:Collect and Process DataActor:Infrastructure ProviderAIRM:Measure

OpenPII-300k

Use cases split across education, health, and psychology, academic only licenses.

3.2.1 Benchmarks & Evaluation

Lifecycle:Collect and Process DataActor:Infrastructure ProviderAIRM:Measure

Multilingual finance PII

20 PII classes, support 7 language, finance specific, Apache 2 license.

3.2.1 Benchmarks & Evaluation

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

Bigcode PII

Crowdsourced PII dataset in code, 31 programming languages.

3.2.1 Benchmarks & Evaluation

Lifecycle:Collect and Process DataActor:Infrastructure ProviderAIRM:Measure

Gretel Navigator

(closed source)

1.1.1 Training Data

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

Flair

Cleaning tool-Online filtering plus scrubbing tools

1.2.1 Guardrails & Filtering

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

Spacy Presido

Cleaning tool-Online filtering plus scrubbing toolsCleaning tool-Online

1.2.1 Guardrails & Filtering

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

JBB-Behaviors

100 distinct misuse behaviors

3.2.1 Benchmarks & Evaluation

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Map

Jasperl S/promt-injections

3.2.1 Benchmarks & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

walledai/JailbreakHub:

1,405 jailbreak prompts out of 15,140 prompts from four platforms (Reddit, Discord, websites, and open-source datasets)

3.2.1 Benchmarks & Evaluation

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Measure

MaliciousInstruct.txt -Princeton-SysML/Jailbreak_L LM

Contains 100 malicious instructions of 10 different malicious intents for evaluation.

3.2.1 Benchmarks & Evaluation

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Measure

Other mitigations from Francois (2025) (82)

Child safety

Child Harm (including but not limited to grooming, minor sexualization, and illegal content such as Child Sexual Abuse Material, or CSAM). Note, this category is separated from other content safety issues given CSAM is illegal to possess, share, or distribute in many jurisdictions. It poses unique challenges for testing and mitigations implementation. 35

99 Other

Lifecycle:Unable to classifyActor:DeveloperAIRM:Map

Content safety

Content policies are specific to an AI system and its use cases. 37 A common example is the MLCommons AILuminate taxonomy. 38 Categories of content safety include but are not limited to: Violent Crimes, Non-violent Crimes, Sex-related Crimes, Child Sexual Exploitation, Indiscriminate Weapons, Suicide & Self-Harm, Hate, Specialized Advice, Defamation.

1.2.1 Guardrails & Filtering

Lifecycle:Plan and DesignActor:DeployerAIRM:Map

Bias / Discrimination (alternatively, Legal and Rights Related)

Generation of content and/or predictive decisions that are biased, discriminatory and/or inconsistent; related to sensitive characteristics such as race, ethnicity, gender, nationality, income, sexual orientation, ability, and political or religious belief.

99.9 Other

Lifecycle:Unable to classifyActor:DeveloperAIRM:Map

Information risks (Privacy infringement)

Leaking, generating, or correctly inferring private and personal information about individuals.

99.9 Other

Lifecycle:Unable to classifyActor:DeveloperAIRM:Map

Model Integrity risks

In-Scope for our work: Basic adversarial aacks like simple jailbreaking remain a focus of our collective work as they are a common threat faced by AI systems. This guidance from NIST reviews typical aack vectors like jailbreaks and data extraction, and includes mitigations. Some of the aack types in the NIST guidance may be out of scope, such as deliberate actions by motivated, experienced adversaries aiming to disrupt, evade, compromise, or abuse the operation of the model or its output.

1.2.1 Guardrails & Filtering

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Map

Model tuning

1.1 Model

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

View all 82 mitigations from this source →

Source Document

A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety

François, Camille; Péran, Ludovic; Bdeir, Ayah; Dziri, Nouha; Hawkins, Will; Jernite, Yacine; Kapoor, Sayash; Shen, Juliet; Khlaaf, Heidy; Klyman, Kevin; Marda, Nik; Pellat, Marie; Raji, Deb; Siddarth, Divya; Skowron, Aviya; Spisak, Joseph; Srikumar, Madhulika; Storchan, Victor; Tang, Audrey; Weedon, Jen (2025)

The rapid rise of open-weight and open-source foundation models is intensifying the obligation and reshaping the opportunity to make AI systems safe. This paper reports outcomes from the Columbia Convening on AI Openness and Safety (San Francisco, 19 Nov 2024) and its six-week preparatory programme involving more than forty-five researchers, engineers, and policy leaders from academia, industry, civil society, and government. Using a participatory, solutions-oriented process, the working groups produced a research agenda at the intersection of safety and open source AI, a mapping of existing and needed technical interventions and open source tools.

View source DOI: 10.48550/arXiv.2506.22183

Classification

AI Lifecycle Stage

Collect and Process Data

Gathering, curating, labelling, and preprocessing training data

Verify and Validate

Responsible Actor

Developer

Entity that creates, trains, or modifies the AI system

NIST AI RMF Function

Other

Risk management function not captured by the standard AIRM categories

MeasureManage

Risk Domains

Primary

1.2 Exposure to toxic content

Other

4 Malicious Actors & Misuse