BackEvaluation

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Evaluation

Francois (2025)|LLM classified

Mitigation Taxonomy

3Ecosystem

3.2Shared Infrastructure

3.2.1Benchmarks & Evaluation

Shared evaluation datasets, testing frameworks, and measurement tools for AI systems.

Also in Shared Infrastructure

3.2.2 Technical Standards3.2.3 Research Resources

LLM Classification Details

Reasoning

Term "Evaluation" lacks specificity; cannot determine focal activity or implementation context without description.

Code: 99.9Version: v0.5Classified: Jan 22, 2026

Sub-mitigations (19)

Thorn model hashes

A dataset of hashes of models that are known to generate AIG-CSAM.

1.2.5 Provenance & Watermarking

Lifecycle:Operate and MonitorActor:DeployerAIRM:Manage

MLCommons AI Safety Benchmarks

(ModelBench) A proof-of-concept benchmark based on LlamaGuard classifiers to assess safety of models.

3.2.1 Benchmarks & Evaluation

Lifecycle:Other (outside lifecycle)Actor:Governance ActorAIRM:Measure

MLCommons AI Saftey Benchmark

(ModelBench). a proof-of-concept benchmark based on LlamaGuard classifiers to assess safety of models.

3.2.1 Benchmarks & Evaluation

Lifecycle:Other (outside lifecycle)Actor:Governance ActorAIRM:Measure

LLM Safety Leaderboard

A unified evaluation for LLM safety, where users can submit models for evaluation based on the evaluation platform DecodingTrust.

3.2.1 Benchmarks & Evaluation

Lifecycle:Other (outside lifecycle)Actor:Governance ActorAIRM:Measure

JailbreakEval

A collection of automated evaluations to assess jailbreak aempts.

3.2.1 Benchmarks & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Purit

An open automation framework to proactively find security and safety risks in their generative AI systems.

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

WalledEval

A testing toolkit comprised of 35 safety benchmarks covering areas such as multilingual safety, exaggerated safety, and prompt injections.

3.2.1 Benchmarks & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

SimpleSafetyTests

Handcrafted prompts covering 5 harm areas (Suicide, self-harm and eating disorders, scams and fraud, illegal items, and physical harm, child abuse), which models should refuse to comply with.

3.2.1 Benchmarks & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

HarmBench

A standardized evaluation framework for automated red teaming

3.2.1 Benchmarks & Evaluation

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Measure

Generative Offensive Agent Tester (GOAT)

An automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vulnerabilities in LLMs.

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Fairness Indicators

TF infrastructure to text group-based

3.2.1 Benchmarks & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Generative Offensive Agent Tester (GOAT)

an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vulnerabilities in LLMs.

3.2.1 Benchmarks & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Propile

Red teaming tools:

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

PII attacks taxonomy

Red teaming tools:

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

JailbreakBench

Tracks performance of aacks and defenses for various LLMs

3.2.1 Benchmarks & Evaluation

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Measure

Red teaming resistance leaderboard

Tests models with craftily constructed prompts to uncover failure modes and vulnerabilities.

3.2.1 Benchmarks & Evaluation

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Measure

DecodingTrust

Evaluates jailbreaking prompts designed to mislead GPT models to assess model robustness of moral recognition.

3.2.1 Benchmarks & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

RapidResponseBench

Assesses how well models adapt to and mitigate various prompt injection aacks.

3.2.1 Benchmarks & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Latent-jailbreak

Systematic analysis of the safety and robustness of LLMs regarding the position of explicit normal instructions, word and instruction replacements

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Other mitigations from Francois (2025) (82)

Child safety

Child Harm (including but not limited to grooming, minor sexualization, and illegal content such as Child Sexual Abuse Material, or CSAM). Note, this category is separated from other content safety issues given CSAM is illegal to possess, share, or distribute in many jurisdictions. It poses unique challenges for testing and mitigations implementation. 35

99 Other

Lifecycle:Unable to classifyActor:DeveloperAIRM:Map

Content safety

Content policies are specific to an AI system and its use cases. 37 A common example is the MLCommons AILuminate taxonomy. 38 Categories of content safety include but are not limited to: Violent Crimes, Non-violent Crimes, Sex-related Crimes, Child Sexual Exploitation, Indiscriminate Weapons, Suicide & Self-Harm, Hate, Specialized Advice, Defamation.

1.2.1 Guardrails & Filtering

Lifecycle:Plan and DesignActor:DeployerAIRM:Map

Bias / Discrimination (alternatively, Legal and Rights Related)

Generation of content and/or predictive decisions that are biased, discriminatory and/or inconsistent; related to sensitive characteristics such as race, ethnicity, gender, nationality, income, sexual orientation, ability, and political or religious belief.

99.9 Other

Lifecycle:Unable to classifyActor:DeveloperAIRM:Map

Information risks (Privacy infringement)

Leaking, generating, or correctly inferring private and personal information about individuals.

99.9 Other

Lifecycle:Unable to classifyActor:DeveloperAIRM:Map

Model Integrity risks

In-Scope for our work: Basic adversarial aacks like simple jailbreaking remain a focus of our collective work as they are a common threat faced by AI systems. This guidance from NIST reviews typical aack vectors like jailbreaks and data extraction, and includes mitigations. Some of the aack types in the NIST guidance may be out of scope, such as deliberate actions by motivated, experienced adversaries aiming to disrupt, evade, compromise, or abuse the operation of the model or its output.

1.2.1 Guardrails & Filtering

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Map

Data for tuning & evaluation

1.1.1 Training Data

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Other

View all 82 mitigations from this source →

Source Document

A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety

François, Camille; Péran, Ludovic; Bdeir, Ayah; Dziri, Nouha; Hawkins, Will; Jernite, Yacine; Kapoor, Sayash; Shen, Juliet; Khlaaf, Heidy; Klyman, Kevin; Marda, Nik; Pellat, Marie; Raji, Deb; Siddarth, Divya; Skowron, Aviya; Spisak, Joseph; Srikumar, Madhulika; Storchan, Victor; Tang, Audrey; Weedon, Jen (2025)

The rapid rise of open-weight and open-source foundation models is intensifying the obligation and reshaping the opportunity to make AI systems safe. This paper reports outcomes from the Columbia Convening on AI Openness and Safety (San Francisco, 19 Nov 2024) and its six-week preparatory programme involving more than forty-five researchers, engineers, and policy leaders from academia, industry, civil society, and government. Using a participatory, solutions-oriented process, the working groups produced a research agenda at the intersection of safety and open source AI, a mapping of existing and needed technical interventions and open source tools.

View source DOI: 10.48550/arXiv.2506.22183

Classification

AI Lifecycle Stage

Verify and Validate

Testing, evaluating, auditing, and red-teaming the AI system

Responsible Actor

Developer

Entity that creates, trains, or modifies the AI system

NIST AI RMF Function

Measure

Quantifying, testing, and monitoring identified AI risks

Risk Domains

Primary

7 AI System Safety, Failures & Limitations