This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Changes to the model's learned parameters, architecture, or training process, including modifications to training data that affect what the model learns.
Also in AI System
Reasoning
Fine-tuning modifies learned capabilities post-training through targeted parameter adjustment.
Data-related
Cost-inducing training of AI models specifically for malicious use
AI models can be designed to make further post-training modifications (e.g., fine-tuning) too costly for malicious uses while preserving normal adaptability for non-malicious uses [88, 56].
1.1.3 Capability ModificationRestrict web access during AI training
Developers can restrict or disable AI systems’ internet access during training. For example, developers can restrict web access to read-only (e.g., by disabling write-access through HTTP POST requests and access to web forms) or limit the access of the AI system to a local network [20].
1.2.2 Runtime EnvironmentAdversarial training
Adversarial training [83] is a technique for training AI models in which adversarial inputs are generated for a model, and the model is then trained to give the correct outputs for those adversarial inputs. Adversarial training can involve adversarial examples generated by human experts, human users, or other AI systems.
1.1.2 Learning ObjectivesRobustness certificates
A model can be certified to withstand adversarial attacks given specific datapoint constraints, model constraints, and attack vectors [156, 124]. Certification means that it can be both analytically proven and shown empirically that the model will withstand such attacks up to a certain threshold. Currently, robustness certification methods are limited to certifying against attacks via manipulation of pixels on specific ℓ p norms, canonically the ℓ 2 (Euclidean) norm, up to a certain neighborhood radius.
2.2.2 Testing & EvaluationCalibrated confidence measures for model predictions
Incorporating calibrated confidence measures alongside a model’s predictions and standard performance metrics, such as accuracy, can help users identify instances of overconfidence in incorrect predictions or underconfidence in correct ones [85]. These additional measures can provide users with more information to better interpret the model’s decisions and assess whether its predictions can be trusted
1.2.9 OtherIncorporating the estimation of atypical input samples or classes for better model reliability
Incorporating the estimation of rare atypical input samples or classes might improve a model’s reliability, both with respect to its predictions and confidence calibration. Model predictions for rare inputs and classes may have a tendency of being overconfident and have worse accuracy scores [232]. For LLMs, the negative log-likelihood can be used as an atypicality measure. For discriminative models, Gaussian Mixture Models can be employed to estimate conditional and marginal distributions, which are then used in atypicality measurement.
1.2.3 Monitoring & DetectionModel development
2.4 Engineering & DevelopmentModel development > Data-related
1.1 ModelModel evaluations
2.2.2 Testing & EvaluationModel evaluations > General evaluations
2.2.2 Testing & EvaluationModel evaluations > Benchmarking
3.2.1 Benchmarks & EvaluationModel evaluations > Red teaming
2.2.2 Testing & EvaluationRisk Sources and Risk Management Measures in Support of Standards for General-Purpose AI Systems
Gipiškis, Rokas; San Joaquin, Ayrton; Chin, Ze Shen; Regenfuß, Adrian; Gil, Ariel; Holtman, Koen (2024)
Organizations and governments that develop, deploy, use, and govern AI must coordinate on effective risk mitigation. However, the landscape of AI risk mitigation frameworks is fragmented, uses inconsistent terminology, and has gaps in coverage. This paper introduces a preliminary AI Risk Mitigation Taxonomy to organize AI risk mitigations and provide a common frame of reference. The Taxonomy was developed through a rapid evidence scan of 13 AI risk mitigation frameworks published between 2023-2025, which were extracted into a living database of 831 distinct AI risk mitigations.
Build and Use Model
Training, fine-tuning, and integrating the AI model
Developer
Entity that creates, trains, or modifies the AI system
Unable to classify
Could not be classified to a specific AIRM function