BackFine-tuning-related

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Fine-tuning-related

Gipiskis (2024)|LLM classified

Mitigation Taxonomy

1AI System

1.1Model

Changes to the model's learned parameters, architecture, or training process, including modifications to training data that affect what the model learns.

Also in AI System

1.2 Non-Model

LLM Classification Details

Reasoning

Fine-tuning modifies learned capabilities post-training through targeted parameter adjustment.

Code: 1.1.3Version: v0.6Classified: Feb 6, 2026

Part of

Data-related

Sub-mitigations (6)

Cost-inducing training of AI models specifically for malicious use

AI models can be designed to make further post-training modifications (e.g., fine-tuning) too costly for malicious uses while preserving normal adaptability for non-malicious uses [88, 56].

1.1.3 Capability Modification

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Restrict web access during AI training

Developers can restrict or disable AI systems’ internet access during training. For example, developers can restrict web access to read-only (e.g., by disabling write-access through HTTP POST requests and access to web forms) or limit the access of the AI system to a local network [20].

1.2.2 Runtime Environment

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Adversarial training

Adversarial training [83] is a technique for training AI models in which adversarial inputs are generated for a model, and the model is then trained to give the correct outputs for those adversarial inputs. Adversarial training can involve adversarial examples generated by human experts, human users, or other AI systems.

1.1.2 Learning Objectives

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Robustness certificates

A model can be certified to withstand adversarial attacks given specific datapoint constraints, model constraints, and attack vectors [156, 124]. Certification means that it can be both analytically proven and shown empirically that the model will withstand such attacks up to a certain threshold. Currently, robustness certification methods are limited to certifying against attacks via manipulation of pixels on specific ℓ p norms, canonically the ℓ 2 (Euclidean) norm, up to a certain neighborhood radius.

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Calibrated confidence measures for model predictions

Incorporating calibrated confidence measures alongside a model’s predictions and standard performance metrics, such as accuracy, can help users identify instances of overconfidence in incorrect predictions or underconfidence in correct ones [85]. These additional measures can provide users with more information to better interpret the model’s decisions and assess whether its predictions can be trusted

1.2.9 Other

Lifecycle:Operate and MonitorActor:DeveloperAIRM:Measure

Incorporating the estimation of atypical input samples or classes for better model reliability

Incorporating the estimation of rare atypical input samples or classes might improve a model’s reliability, both with respect to its predictions and confidence calibration. Model predictions for rare inputs and classes may have a tendency of being overconfident and have worse accuracy scores [232]. For LLMs, the negative log-likelihood can be used as an atypicality measure. For discriminative models, Gaussian Mixture Models can be employed to estimate conditional and marginal distributions, which are then used in atypicality measurement.

1.2.3 Monitoring & Detection

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Manage

Other mitigations from Gipiskis (2024) (112)

Model development

2.4 Engineering & Development

Lifecycle:Build and Use ModelActor:DeveloperAIRM:Unable to classify

Model development > Data-related

1.1 Model

Lifecycle:Collect and Process DataActor:Unable to classifyAIRM:Unable to classify

Model evaluations

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Model evaluations > General evaluations

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Model evaluations > Benchmarking

3.2.1 Benchmarks & Evaluation

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Measure

Model evaluations > Red teaming

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

View all 112 mitigations from this source →

Source Document

Risk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems

Gipiškis, Rokas; San Joaquin, Ayrton; Chin, Ze Shen; Regenfuß, Adrian; Gil, Ariel; Holtman, Koen (2024)

Organizations and governments that develop, deploy, use, and govern AI must coordinate on effective risk mitigation. However, the landscape of AI risk mitigation frameworks is fragmented, uses inconsistent terminology, and has gaps in coverage. This paper introduces a preliminary AI Risk Mitigation Taxonomy to organize AI risk mitigations and provide a common frame of reference. The Taxonomy was developed through a rapid evidence scan of 13 AI risk mitigation frameworks published between 2023-2025, which were extracted into a living database of 831 distinct AI risk mitigations.

View source DOI: 10.48550/arXiv.2410.23472

Classification

AI Lifecycle Stage

Build and Use Model

Training, fine-tuning, and integrating the AI model

Responsible Actor

Developer

Entity that creates, trains, or modifies the AI system

NIST AI RMF Function

Unable to classify

Could not be classified to a specific AIRM function