This page is still being polished. If you have thoughts, please share them via the feedback form.
Data on this page is preliminary and may change. Please do not share or cite these figures publicly.
Shared benchmarks, datasets, standards, and resources that produce adoptable artifacts. These are public goods that reduce transaction costs for any organization that adopts them, without requiring coordination or enforcement.
Also in Ecosystem
We believe that we also need to complement our mitigation efforts by offering mitigations for the ecosystem. • We released ShieldGemma — a series of state-ofthe-art safety classifiers that developers can apply to detect and mitigate harmful content in AI model input and outputs. Specifically, ShieldGemma is designed to target hate speech, harassment, sexually explicit content, and dangerous content. • We offer an existing suite of safety classifiers in our Responsible Generative AI Toolkit, which includes a methodology to build classifiers tailored to a specific policy with limited number of datapoints, as well as existing Google Cloud off-the-shelf classifiers served via API. • We share AI interpretability tools to help researchers improve AI safety. Our research teams are continually exploring new ways to better understand how models behave. For example, we recently announced Gemma Scope, a new set of tools enabling researchers to “peer inside” the workings of our Gemma 2 model to see how it parses and completes tasks. We believe that this kind of interpretability could open up new opportunities to identify and mitigate safety risks at the model behavior level. • We launched the SAIF Risk Self Assessment, a questionnaire-based tool that generates a checklist to guide AI practitioners responsible for securing AI systems. The tool will immediately provide a report highlighting specific risks such as data poisoning, prompt injection, and model source tampering, tailored to the submittor’s AI systems, as well as suggested mitigations, based on the responses they provided.
Managing content safety
We leverage the expertise our Trust & Safety teams have honed over decades of abuse fighting to establish model and application-level mitigations for a wide range of content safety risks. A critical piece of our safety strategy is a pre-launch risk assessment that identifies which applications have sufficiently great or novel risks that require specialized testing and controls. We also employ guardrails in our models and products to reduce the risk of generating harmful content, for example: • Safety filters. We build safety classifiers to prevent our models from showing users harmful outputs such as suicide content or pornography. • System instructions. We steer our models to produce content that aligns with our safety guidelines by using system instructions — prompts that tell the model how to behave when it responds to user inputs. 13 Manage • Safety tuning. We fine-tune our models to produce helpful, high-quality answers that align to our safety guidelines.
1.2.1 Guardrails & FilteringManaging security
We use the SAIF framework to mitigate known and novel AI security risks. The latter category includes risks such as data poisoning, model exfiltration, and rogue actions. We apply security controls, or repeatable mitigations, to these risks. For example, for prompt injections and jailbreaks, we apply robust filtering and processing of inputs and outputs. Additionally, thorough training, tuning, and evaluation processes help fortify the model against prompt injection attacks. For data poisoning, we implement data sanitization, secure AI systems, enable access controls, and deploy mechanisms to ensure data and model integrity. We have published a full list of our controls for AI security risks. In addition, we continue to research new ways to help mitigate a model’s susceptibility to security attacks. For example, we’ve developed an AI agent that auto-detects real-world code for security risks.
2.3.2 Access & Security ControlsManaging privacy
We have invested deeply in mitigations for privacy risks, as well as researching new risks that might emerge from evolving capabilities like agentic. For examples, our paper on how AI assistants can better protect privacy by using a “contextual integrity” framework to steer AI assistants to only share information that is appropriate for a given context.
1.1.2 Learning ObjectivesPhased launches
. A gradual approach to deployment is a critical risk mitigation. We have a multi-layered approach — starting with testing internally, then releasing to trusted testers externally, then opening up to a small portion of our user base (for example, Gemini Advanced users first). We also phase our country and language releases, constantly testing to ensure mitigations are working as intended before we expand. And finally, we have careful protocols and additional testing and mitigations required before a product is released to under 18s. To give an example, as Gemini 2.0’s multimodality increases the complexity of potential outputs, we have been careful to release it in a phased way via trusted testers and subsets of countries.
2.3.1 Deployment ManagementMonitoring and rapid remediation
We design our applications to promote user feedback on both quality and safety, through user interfaces that encourage users to provide thumbs up/down and give qualitative feedback where appropriate. Our teams monitor user feedback via these channels closely, as well as feedback delivered through other channels. We have mature incident management and crisis response capabilities to rapidly mitigate and remediate where needed, and feed this back into our risk identification efforts. Importantly, teams are enabled to have rapid-remediation mechanisms in place to block content flagged as illegal.
2.3.4 Incident ResponseProvenance
Outputs of our generative AI products typically carry watermarking (via our SynthID technology) and, when it comes to imagery, relevant metadata (per IPTC standards). As an example, About This Image in Google Image Search started identifying and labeling AI-generated images with SynthID in 2023, alongside other image metadata. We’ve opensourced SynthID to make it easier for any developer to apply watermarking for their own generative AI models, and shared our analysis of how labeling AI-generated content helps people make informed decisions about the content they see online. Google Search, Ads, and YouTube are also implementing the latest version of the Coalition for Content Provenance and Authenticity (C2PA)’s authentication standard. And moving forward, we plan to continue investing in the deployment of C2PA across our services.
1.2.5 Provenance & WatermarkingOther (outside lifecycle)
Outside the standard AI system lifecycle
Developer
Entity that creates, trains, or modifies the AI system
Other
Risk management function not captured by the standard AIRM categories
Primary
7 AI System Safety, Failures & Limitations