BackBenchmarking

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Benchmarking

Gipiskis (2024)|LLM classified

Mitigation Taxonomy

3Ecosystem

3.2Shared Infrastructure

3.2.1Benchmarks & Evaluation

Shared evaluation datasets, testing frameworks, and measurement tools for AI systems.

Also in Shared Infrastructure

3.2.2 Technical Standards3.2.3 Research Resources

LLM Classification Details

Reasoning

Shared evaluation dataset or benchmark enabling standardized AI system assessment across organizations.

Code: 3.2.1Version: v0.6Classified: Feb 6, 2026

Part of

Model evaluations

Sub-mitigations (11)

Benchmark inaccuracy

99 Other

Lifecycle:Other (outside lifecycle)Actor:Unable to classifyAIRM:Unable to classify

Benchmark limitations

99 Other

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Unable to classify

Good benchmarking practices

3.2.1 Benchmarks & Evaluation

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Measure

Benchmarking

Benchmarking is an evaluation method where different models are compared against a standardized dataset or a predetermined task. It allows comparison both across different models and over time, providing a reference point for model assessment. Benchmarks are usually open, where their question-answer pairs are publicly available [235].

3.2.1 Benchmarks & Evaluation

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Measure

Test robustness of GPAI system on relevant benchmarks

Various benchmarks [52, 236] have been developed to assess the robustness of GPAI systems when deployed in environments or scenarios that differ from their training conditions. These benchmarks typically evaluate the model’s ability to handle variations in inputs, unexpected data distributions, or adversarial examples, aiming to ensure reliable performance outside the original training domain.

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Frequent benchmarking to identify when red teaming is needed

Benchmarks, once created, are inexpensive to apply but may be less informative than red teaming. One reason is that sensitive data (e.g., relating to CBRNrelated capabilities) cannot be included in the public questions and answers of benchmarks. On the other hand, red teaming can be more accurate given participants with diverse attack strategies, but it requires more resources to execute than benchmarking. If there is a correlation between benchmarking and redteaming scores, then employing frequent benchmarking during the development of the model can identify arising vulnerabilities and inform the developers when more thorough red teaming is required [21]. Benchmarks can act as early warning signs of a larger issue, and red teaming can then be employed to investigate the severity and extent of such an issue.

2.2.2 Testing & Evaluation

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Measure

Contamination detection

Contamination detection refers to techniques for assessing whether and to what extent a given model has benchmark data in its training dataset [161, 170]. This can involve a set of technical and regulatory interventions to prevent or identify a model trained on contaminated data.

2.2.2 Testing & Evaluation

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Measure

Preventing or mitigating data contamination and leakage

Developers of GPAIS and creators of benchmarks can take actions to prevent AI models from being trained on contaminated or leaked data, or mitigate such data contamination and leakage.

1.1.1 Training Data

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Manage

Tracking benchmark leakage

Constant monitoring and documentation of benchmark leakage can help with the early detection of benchmark leaks [224, 95]

2.3.3 Monitoring & Logging

Lifecycle:Operate and MonitorActor:DeveloperAIRM:Measure

Reporting data decontamination efforts

Decontamination analysis may involve comparing the training dataset with the benchmark data, and publishing a report with statistics such as data overlap [235]. Especially for already trained models, including contamination statistics can allow experts to reweigh the success of the model on affected benchmarks.

2.2.4 Assurance Documentation

Lifecycle:Collect and Process DataActor:DeveloperAIRM:Measure

Avoiding benchmark data with publicly available solutions and releasing contextual information for internet-derived data

Evaluators can improve the integrity of benchmarks by avoiding data with publicly available solutions. When using internet-derived data, supplementing it with contextual information (such as entity linking) may help to better document and assess the integrity of the collected data [95]. This could help in detecting instances where contextual information (which may have been included in the training data) might inadvertently reveal details about the solution, ensuring that the data is suitable for accurate benchmarking.

3.2.1 Benchmarks & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Other mitigations from Gipiskis (2024) (112)

Model development

2.4 Engineering & Development

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Unable to classify

Model development > Data-related

1.1 Model

Lifecycle:Collect and Process DataActor:Unable to classifyAIRM:Unable to classify

Model evaluations

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Model evaluations > General evaluations

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Model evaluations > Red teaming

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Model evaluations > Auditing

2.2.3 Auditing & Compliance

Lifecycle:Verify and ValidateActor:Governance ActorAIRM:Measure

View all 112 mitigations from this source →

Source Document

Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

Gipiškis, Rokas; San Joaquin, Ayrton; Chin, Ze Shen; Regenfuß, Adrian; Gil, Ariel; Holtman, Koen (2024)

Organizations and governments that develop, deploy, use, and govern AI must coordinate on effective risk mitigation. However, the landscape of AI risk mitigation frameworks is fragmented, uses inconsistent terminology, and has gaps in coverage. This paper introduces a preliminary AI Risk Mitigation Taxonomy to organize AI risk mitigations and provide a common frame of reference. The Taxonomy was developed through a rapid evidence scan of 13 AI risk mitigation frameworks published between 2023-2025, which were extracted into a living database of 831 distinct AI risk mitigations.

View source DOI: 10.48550/arXiv.2410.23472

Classification

AI Lifecycle Stage

Other (outside lifecycle)

Outside the standard AI system lifecycle

Responsible Actor

Developer

Entity that creates, trains, or modifies the AI system

NIST AI RMF Function

Measure

Quantifying, testing, and monitoring identified AI risks

Risk Domains

Primary

7.3 Lack of capability or robustness