Conduct model evaluations and read teami…

BackAllow independent, external evaluators to conduct model evaluations throughout the model lifecycle

This page is still being polished. If you have thoughts, please share them via the feedback form.

Data on this page is preliminary and may change. Please do not share or cite these figures publicly.

Allow independent, external evaluators to conduct model evaluations throughout the model lifecycle

UK_DSIT (2023)|LLM classified

Mitigation Taxonomy

2Organisation

2.2Risk & Assurance

2.2.3Auditing & Compliance

Independent audits, third-party reviews, and regulatory compliance verification.

Also in Risk & Assurance

2.2.1 Risk Assessment2.2.2 Testing & Evaluation2.2.4 Assurance Documentation

Definition

Evaluation by independent third parties will allow frontier AI organisations to draw on external expertise, have more ‘eyes on the problem’, and provide for more accountability. External evaluation is particularly important at the pre-deployment phase, and can inform irreversible decisions around deployment of the model. Appropriate legal advice and confidentiality agreements may also protect any market-sensitive data when sharing information with third parties. For the subset of evaluations which may touch on national security concerns, a secure environment may be needed with appropriately cleared officials. There are further opportunities for independent evaluation for open source models, given the potential broader community involvement.

LLM Classification Details

Reasoning

Independent external evaluators conduct model assessments throughout lifecycle—third-party audit activity.

Code: 2.2.3Version: v0.6Classified: Feb 6, 2026

Part of

Model evaluations and red teaming

Frontier AI may pose increased risks of harm related to misuse, loss of control, and other societal risks. Different methods are being developed to assess AI systems and their potential harmful impacts. Model evaluations- such as benchmarking- can be used to produce quantitative, easily replicable measurements of the capabilities and other traits of AI systems. Red teaming provides an alternative approach, which involves observing an AI system from the perspective of an adversary to understand how they could compromise or misuse it. Assessments like model evaluations and red teaming could help to understand the risks frontier AI systems pose and their potential harmful impacts, and help frontier AI organisations, regulators, and users to make informed decisions about training, securing, and deploying them. As methods for assessing frontier AI systems are still emerging, it is important to support and share information about the development and testing of these methods.

Sub-mitigations (8)

Ensure that evaluators are independent and have sufficient AI and subject matter expertise across a wide range of relevant subjects and backgrounds

External evaluators’ relationships with frontier AI organisations could be structured to minimise conflicts of interest and encourage independence of judgement as far as practically possible. As well as expertise in AI, there are many other areas of subject matter expertise that will be needed to evaluate an AI system’s features. For instance, experts on topics as wide as fairness, psychological harm, and catastrophic risk will be needed.

2.2.3 Auditing & Compliance

Lifecycle:Verify and ValidateActor:Governance ActorAIRM:Govern

Ensure that there are appropriate safeguards against external evaluations leading to unintended widespread distribution of models

Allowing external evaluators to download models onto their own hardware increases the chance of the models being stolen or leaked. Therefore, unless adequate security against widespread model distribution can be assured, external evaluators could only be allowed to access models through interfaces that prevent exfiltration (such as current API access methods). It may be appropriate to limit evaluators’ access to information that could indirectly facilitate widespread model distribution in other ways, such as requiring in-depth KYC checks or watermarking copies of the model.

2.3.2 Access & Security Controls

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Manage

Give external evaluators the ability to securely ‘fine-tune’ the AI systems being tested

Evaluators cannot fully assess risks associated with widespread model distribution if they cannot fine-tune the model. This may involve providing external evaluators with access to capable infrastructure to enable fine-tuning.

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:Governance ActorAIRM:Measure

Give external evaluators sufficient time

As expected risks from models increase or models get more complex to evaluate, the time afforded for evaluation may need to increase as well.

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Give external evaluators access to versions of the model that lack safety mitigations

Where possible, sharing these versions of a model gives evaluators insight into the risks that might be created if users find ways to circumvent safeguards (meaning ‘jailbreak’ the model). If the model is open-sourced, leaked, or stolen, users may also simply be able to remove or bypass the safety mitigations.

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Give external evaluators access to model families and internal metrics

Frontier AI organisations often develop ‘model families’ where multiple models differ along only 1 or 2 dimensions – such as parameters, data, or training compute. Evaluating such a model family would enable scaling analysis to better forecast future performance, capabilities and risks.

2.2.2 Testing & Evaluation

Lifecycle:Verify and ValidateActor:DeveloperAIRM:Measure

Give external evaluators the ability to study all of the components of deployed systems, where possible

Deployed AI systems typically combine a core model with smaller models and other software components, including moderation filters, user interfaces to incentivise particular user behaviour, and plug-ins for extension capabilities like web browsing or code execution. For example, a red team cannot find all the flaws in the defences of a system if they aren’t able to test all of its different components. It is important to consider the need to balance external evaluators’ ability to access all components of the system against the need to protect information that would allow bypassing model defences.

2.2.2 Testing & Evaluation

Lifecycle:Operate and MonitorActor:DeployerAIRM:Measure

Allow evaluators to share and discuss the results of their evaluations, with potential restrictions where necessary

for example, not sharing proprietary information, information whose spread could lead to substantial harm or information that would have an adverse effect on competition in the market. Sharing the results of evaluations can allow governments, regulators, users, and other frontier AI organisations to make informed decisions.

3.3.1 Industry Coordination

Lifecycle:Other (outside lifecycle)Actor:Governance ActorAIRM:Govern

Other mitigations from UK_DSIT (2023) (181)

Model reporting and information sharing

Transparency around frontier AI can help governments to effectively realise the benefits of AI and mitigate AI risks. Transparency can also encourage sharing of best practices across frontier AI organisations, enable users to make well-informed choices about whether and how to use AI systems, and increase public trust, helping to drive AI adoption. Reporting and sharing information where appropriate could ensure that different parties can access the information they need to support effective governance, develop best practice, inform decision-making about the use of AI systems, and build public trust. Some reporting practices- such as model cards- are already used among frontier AI organisations, whereas other practices included here are areas for future consideration. Given the recent rapid pace of progress in AI, the appropriate government and international governance institutions are still being considered and we recognise that limits the ability of frontier AI organisations to share information with governments, even where it would be desirable. Throughout this section ‘relevant government authorities’ is used to indicate a good practice for information sharing with governments while recognising such relevant authorities may still be under development.

3.3.1 Industry Coordination

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Govern

Model reporting and information sharing > Share model-agonistic information

3.3.1 Industry Coordination

Lifecycle:Other (outside lifecycle)Actor:Governance ActorAIRM:Govern

Model reporting and information sharing > Share model-specific information

Sharing information about specific frontier AI models allows external actors to develop a more granular picture of ongoing AI development and potential risks that will need to be addressed.

3.3.1 Industry Coordination

Lifecycle:Other (outside lifecycle)Actor:DeveloperAIRM:Map

Model reporting and information sharing > Share different information with different parties

99 Other

Lifecycle:Other (outside lifecycle)Actor:Governance ActorAIRM:Manage

Security controls including securing model weights

To ensure the safety of frontier AI, consideration of cyber security, protective security risk management and insider risk mitigation is key. Cyber security, both of models and the systems that deploy them, must be considered from the outset of development to ensure that the benefits of AI can be realised. Cyber security is a key underpinning for the safety, reliability, predictability, ethics and potential regulatory compliance of an AI system. To avoid putting safety or sensitive data at risk, it is important to consider the cyber security of AI systems, as well as models in isolation, and to implement cyber security processes throughout the AI lifecycle, particularly where that component is a foundation for other systems. As AI systems advance, developers must maintain an awareness of possible attacks, identify vulnerabilities and implement mitigations. Failure to do so will risk designing vulnerabilities into future AI models and systems. A Secure by Design approach allows developers to ‘bake in’ security from the outset of design and development. Cyber security must be considered in concert with physical and personnel security. Developing a coherent, holistic, risk based and proportionate security strategy, supported by effective governance structures, is essential. Where the compromise of an AI system could lead to tangible or widespread physical damage, significant loss of business operations, leakage of sensitive or confidential information, reputational damage and/or legal challenge, then it is important that AI security risks are treated as mission critical.

2.3.2 Access & Security Controls

Lifecycle:Other (multiple stages)Actor:DeveloperAIRM:Manage

Security controls including securing model weights > Implement strong cyber security measures and processes (including security evaluations) across their AI systems, including underlying infrastructure and supply chains

2.3 Operations & Security

Lifecycle:Other (multiple stages)Actor:Other (multiple actors)AIRM:Manage

View all 181 mitigations from this source →

Source Document

Emerging processes for frontier AI safety

UK Department for Science, Innovation and Technology (2023)

The UK recognises the enormous opportunities that AI can unlock across our economy and our society. However, without appropriate guardrails, such technologies can pose significant risks. The AI Safety Summit will focus on how best to manage the risks from frontier AI such as misuse, loss of control and societal harms. Frontier AI organisations play an important role in addressing these risks and promoting the safety of the development and deployment of frontier AI. The UK has therefore encouraged frontier AI organisations to publish details on their frontier AI safety policies ahead of the AI Safety Summit hosted by the UK on 1 to 2 November 2023. This will provide transparency regarding how they are putting into practice voluntary AI safety commitments and enable the sharing of safety practices within the AI ecosystem. Transparency of AI systems can increase public trust, which can be a significant driver of AI adoption. This document complements these publications by providing a potential list of frontier AI organisations’ safety policies. These have been gathered after extensive research and will need updating regularly given the emerging nature of this technology. The safety processes are not listed in order of importance but are summarised in themes. The government is not suggesting or mandating any particular combination of policies – merely detailing the current suite available so that others can understand, interpret and compare frontier companies’ safety policies. This document contains the world’s first overview of emerging safety processes focused on frontier AI and is intended to be a useful tool to boost transparency. This conversation is for frontier AI and whilst it is important that safety is applied throughout the AI sector, it is also important that innovation is not stifled, hence why policies must be proportionate and based on capabilities which are the key driver of risk. This document contains processes and associated practices that some frontier AI organisations are already implementing and others that are being considered within academia and broader civil society. It is intended as a guide for readers of frontier AI companies’ AI safety policies to better understand what good policy might look like, though organisations themselves will be best placed to determine their applicability. Through this exercise, the government intends to help inform dialogue on potential appropriate measures for individual organisations to consider at the UK AI Safety Summit.

View source

Classification

AI Lifecycle Stage

Other (multiple stages)

Applies across multiple lifecycle stages

Verify and ValidateOperate and Monitor

Responsible Actor

Governance Actor

Regulator, standards body, or oversight entity shaping AI policy

Developer

NIST AI RMF Function

Measure

Quantifying, testing, and monitoring identified AI risks

Govern

Risk Domains

Primary

7.2 AI possessing dangerous capabilities

Other

6.5 Governance failure7 AI System Safety, Failures & Limitations