This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Shared evaluation datasets, testing frameworks, and measurement tools for AI systems.
Also in Shared Infrastructure
Reasoning
Shared evaluation dataset or benchmark enabling standardized AI system assessment across organizations.
Model evaluations
Benchmark inaccuracy
99 OtherBenchmark limitations
99 OtherGood benchmarking practices
3.2.1 Benchmarks & EvaluationBenchmarking
Benchmarking is an evaluation method where different models are compared against a standardized dataset or a predetermined task. It allows comparison both across different models and over time, providing a reference point for model assessment. Benchmarks are usually open, where their question-answer pairs are publicly available [235].
3.2.1 Benchmarks & EvaluationTest robustness of GPAI system on relevant benchmarks
Various benchmarks [52, 236] have been developed to assess the robustness of GPAI systems when deployed in environments or scenarios that differ from their training conditions. These benchmarks typically evaluate the model’s ability to handle variations in inputs, unexpected data distributions, or adversarial examples, aiming to ensure reliable performance outside the original training domain.
2.2.2 Testing & EvaluationFrequent benchmarking to identify when red teaming is needed
Benchmarks, once created, are inexpensive to apply but may be less informative than red teaming. One reason is that sensitive data (e.g., relating to CBRNrelated capabilities) cannot be included in the public questions and answers of benchmarks. On the other hand, red teaming can be more accurate given participants with diverse attack strategies, but it requires more resources to execute than benchmarking. If there is a correlation between benchmarking and redteaming scores, then employing frequent benchmarking during the development of the model can identify arising vulnerabilities and inform the developers when more thorough red teaming is required [21]. Benchmarks can act as early warning signs of a larger issue, and red teaming can then be employed to investigate the severity and extent of such an issue.
2.2.2 Testing & EvaluationContamination detection
Contamination detection refers to techniques for assessing whether and to what extent a given model has benchmark data in its training dataset [161, 170]. This can involve a set of technical and regulatory interventions to prevent or identify a model trained on contaminated data.
2.2.2 Testing & EvaluationPreventing or mitigating data contamination and leakage
Developers of GPAIS and creators of benchmarks can take actions to prevent AI models from being trained on contaminated or leaked data, or mitigate such data contamination and leakage.
1.1.1 Training DataTracking benchmark leakage
Constant monitoring and documentation of benchmark leakage can help with the early detection of benchmark leaks [224, 95]
2.3.3 Monitoring & LoggingReporting data decontamination efforts
Decontamination analysis may involve comparing the training dataset with the benchmark data, and publishing a report with statistics such as data overlap [235]. Especially for already trained models, including contamination statistics can allow experts to reweigh the success of the model on affected benchmarks.
2.2.4 Assurance DocumentationAvoiding benchmark data with publicly available solutions and releasing contextual information for internet-derived data
Evaluators can improve the integrity of benchmarks by avoiding data with publicly available solutions. When using internet-derived data, supplementing it with contextual information (such as entity linking) may help to better document and assess the integrity of the collected data [95]. This could help in detecting instances where contextual information (which may have been included in the training data) might inadvertently reveal details about the solution, ensuring that the data is suitable for accurate benchmarking.
3.2.1 Benchmarks & EvaluationModel development
2.4 Engineering & DevelopmentModel development > Data-related
1.1 ModelModel evaluations
2.2.2 Testing & EvaluationModel evaluations > General evaluations
2.2.2 Testing & EvaluationModel evaluations > Red teaming
2.2.2 Testing & EvaluationModel evaluations > Auditing
2.2.3 Auditing & ComplianceRisk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems
Gipiškis, Rokas; San Joaquin, Ayrton; Chin, Ze Shen; Regenfuß, Adrian; Gil, Ariel; Holtman, Koen (2024)
Organizations and governments that develop, deploy, use, and govern AI must coordinate on effective risk mitigation. However, the landscape of AI risk mitigation frameworks is fragmented, uses inconsistent terminology, and has gaps in coverage. This paper introduces a preliminary AI Risk Mitigation Taxonomy to organize AI risk mitigations and provide a common frame of reference. The Taxonomy was developed through a rapid evidence scan of 13 AI risk mitigation frameworks published between 2023-2025, which were extracted into a living database of 831 distinct AI risk mitigations.
Other (outside lifecycle)
Outside the standard AI system lifecycle
Developer
Entity that creates, trains, or modifies the AI system
Measure
Quantifying, testing, and monitoring identified AI risks