Table of Contents
Fetching ...

ShieldGemma 2: Robust and Tractable Image Content Moderation

Wenjun Zeng, Dana Kurniawan, Ryan Mullins, Yuchi Liu, Tamoghna Saha, Dirichi Ike-Njoku, Jindong Gu, Yiwen Song, Cai Xu, Jingjing Zhou, Aparna Joshi, Shravan Dheep, Mani Malek, Hamid Palangi, Joon Baek, Rick Pereira, Karthik Narasimhan

TL;DR

ShieldGemma 2 (SG2) tackles robust image safety classification for both synthetic and natural images by fine-tuning a 4B-parameter Gemma 3 base model with policy-aware outputs. It introduces a novel Borderline Adversarial Data Generation (BADG) pipeline to create diverse, adversarial training data, enabling strong performance across three harm categories: Sexual, Dangerous Content, and Violence & Gore. In extensive internal and external benchmarks, SG2 achieves state-of-the-art results and benefits from continuous confidence scoring to support adjustable thresholds in downstream applications. The work also delivers an open-source safety detector framework and data generation resources to advance multimodal safety research and responsible AI development.

Abstract

We introduce ShieldGemma 2, a 4B parameter image content moderation model built on Gemma 3. This model provides robust safety risk predictions across the following key harm categories: Sexually Explicit, Violence \& Gore, and Dangerous Content for synthetic images (e.g. output of any image generation model) and natural images (e.g. any image input to a Vision-Language Model). We evaluated on both internal and external benchmarks to demonstrate state-of-the-art performance compared to LlavaGuard \citep{helff2024llavaguard}, GPT-4o mini \citep{hurst2024gpt}, and the base Gemma 3 model \citep{gemma_2025} based on our policies. Additionally, we present a novel adversarial data generation pipeline which enables a controlled, diverse, and robust image generation. ShieldGemma 2 provides an open image moderation tool to advance multimodal safety and responsible AI development.

ShieldGemma 2: Robust and Tractable Image Content Moderation

TL;DR

ShieldGemma 2 (SG2) tackles robust image safety classification for both synthetic and natural images by fine-tuning a 4B-parameter Gemma 3 base model with policy-aware outputs. It introduces a novel Borderline Adversarial Data Generation (BADG) pipeline to create diverse, adversarial training data, enabling strong performance across three harm categories: Sexual, Dangerous Content, and Violence & Gore. In extensive internal and external benchmarks, SG2 achieves state-of-the-art results and benefits from continuous confidence scoring to support adjustable thresholds in downstream applications. The work also delivers an open-source safety detector framework and data generation resources to advance multimodal safety research and responsible AI development.

Abstract

We introduce ShieldGemma 2, a 4B parameter image content moderation model built on Gemma 3. This model provides robust safety risk predictions across the following key harm categories: Sexually Explicit, Violence \& Gore, and Dangerous Content for synthetic images (e.g. output of any image generation model) and natural images (e.g. any image input to a Vision-Language Model). We evaluated on both internal and external benchmarks to demonstrate state-of-the-art performance compared to LlavaGuard \citep{helff2024llavaguard}, GPT-4o mini \citep{hurst2024gpt}, and the base Gemma 3 model \citep{gemma_2025} based on our policies. Additionally, we present a novel adversarial data generation pipeline which enables a controlled, diverse, and robust image generation. ShieldGemma 2 provides an open image moderation tool to advance multimodal safety and responsible AI development.

Paper Structure

This paper contains 21 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Synthetic Image Generation Pipeline.
  • Figure 2: Instructions for Supervised Fine-Tuning. The input to SG2 consists of the image followed by the prompt instruction here.
  • Figure 3: Example Images initially labeled as Illegal activity in the original dataset, but re-annotated as not violating dangerous content after applying our policy.
  • Figure 4: Example Images initially labeled as sexual in the original dataset, but re-annotated as not violating sexually explicit after applying our policy.
  • Figure 5: Example Images initially labeled as violence in the original dataset, but re-annotated as not violating violence and gore after applying our policy.