This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Training methods that shape model behavior through objectives, feedback, and optimization targets.
Also in Model
State-of-the-art reinforcement learning and fine-tuning techniques, such as Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO), to ensure models do not engage in unsafe behavior.
Multiple experts noted that harmlessness training techniques like RLHF and DPO can be helpful in reducing certain risks, particularly those related to bias, discrimination, and harmful language. However, many also emphasised that these methods are not sufficient on their own and have limitations. Several experts pointed out that harmlessness training can be bypassed or "jailbroken" by determined actors, making it less effective against high-impact risks or threats from malicious users. Several mentioned that it may be more effective against novice threats than sophisticated ones. Some experts raised concerns about defining "unsafe behaviour" or "harm," noting the lack of consensus and potential cultural differences. A couple suggested that oversight of implementation would be necessary. Multiple experts commented that harmlessness training might be more effective for everyday interactions but less so for extreme cases or specific incidents related to critical infrastructure or chemical, biological, radiological, and nuclear (CBRN) risks. Several experts mentioned that these techniques might create a superficial "mask" over the model's behaviour without fundamentally changing its capabilities or knowledge. Some individual points raised included: the need to consider the wellbeing of workers involved in the training process, the potential for these techniques to become outdated, and the importance of balancing multiple objectives rather than focusing on single values.
Reasoning
RLHF and DPO shape model behavior through human feedback optimization targets during training.
Pre-deployment risk assessments
Comprehensive risk assessments before deployment that would assess reasonably foreseeable misuse and include dangerous capability evaluations that incorporate post-training enhancements and collaborations with domain experts. Risk assessments would inform deployment decisions.
2.2.1 Risk AssessmentThird party pre-deployment model audits
External pre-deployment assessment to provide a judgment on the safety of a model. Auditors, which could be governments or independent third parties, would receive access to a fine-tuning API for testing, or further appropriate technical means.
2.2.3 Auditing & ComplianceExternal assessment of testing procedure
Bringing in external AI evaluation firms before deployment to assess and red-team the company's execution of dangerous capabilities evaluations.
2.2.2 Testing & EvaluationVetted researcher access
Giving good faith, public interest evaluation researchers access to black-box research APIs that provide technical and legal safe harbours to limit barriers imposed by usage policy enforcement, logging, and stringent terms of service.
2.3.1 Deployment ManagementAdvanced model access for vetted external researchers
Examples of advanced access rights could include any of the following: increased control over sampling, access to fine-tuning functionality, the ability to inspect and modify model internals, access to training data, or additional features like stable model versions.
2.2.2 Testing & EvaluationData curation
Careful data curation prior to all development stages (including fine-tuning) to filter out high-risk content and ensure the training data is sufficiently high-quality.
1.1.1 Training DataEffective Mitigations for Systemic Risks from General-Purpose AI
Uuk, Risto; Brouwer, Annemieke; Schreier, Tim; Dreksler, Noemi; Pulignano, Valeria; Bommasani, Rishi (2024)
The systemic risks posed by general-purpose AI models are a growing concern, yet the effectiveness of mitigations remains underexplored. Previous research has proposed frameworks for risk mitigation, but has left gaps in our understanding of the perceived effectiveness of measures for mitigating systemic risks. Our study addresses this gap by evaluating how experts perceive different mitigations that aim to reduce the systemic risks of general-purpose AI models. We surveyed 76 experts whose expertise spans AI safety; critical infrastructure; democratic processes; chemical, biological, radiological, and nuclear risks (CBRN); and discrimination and bias. Among 27 mitigations identified through a literature review, we find that a broad range of risk mitigation measures are perceived as effective in reducing various systemic risks and technically feasible by domain experts. In particular, three mitigation measures stand out: safety incident reports and security information sharing, third-party pre-deployment model audits, and pre-deployment risk assessments. These measures show both the highest expert agreement ratings (>60\%) across all four risk areas and are most frequently selected in experts' preferred combinations of measures (>40\%). The surveyed experts highlighted that external scrutiny, proactive evaluation and transparency are key principles for effective mitigation of systemic risks. We provide policy recommendations for implementing the most promising measures, incorporating the qualitative contributions from experts. These insights should inform regulatory frameworks and industry practices for mitigating the systemic risks associated with general-purpose AI.
Build and Use Model
Training, fine-tuning, and integrating the AI model
Developer
Entity that creates, trains, or modifies the AI system
Manage
Prioritising, responding to, and mitigating AI risks