Table of Contents
Fetching ...

OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models

Ziheng Cheng, Yixiao Huang, Hui Xu, Somayeh Sojoudi, Xuandong Zhao, Dawn Song, Song Mei

TL;DR

OVERT introduces the first large-scale benchmark for evaluating over-refusal in text-to-image models by automatically generating 4,600 benign prompts across nine safety categories and 1,785 genuinely harmful prompts (OVERT-unsafe). The authors design a scalable pipeline using WildGuardMix seeds, Gemini-2.0-Flash conversions, and rigorous post-processing (including human auditing and rejection sampling) to create high-quality evaluation data. Experiments across five frontier T2I models reveal that over-refusal is widespread and positively correlates with safety strictness, underscoring a critical safety-utility trade-off. The study also investigates prompt rewriting as a mitigation (finding fidelity loss) and demonstrates how the framework supports dynamic safety policy adaptation to align with different norms and use cases.

Abstract

Text-to-Image (T2I) models have achieved remarkable success in generating visual content from text inputs. Although multiple safety alignment strategies have been proposed to prevent harmful outputs, they often lead to overly cautious behavior -- rejecting even benign prompts -- a phenomenon known as $\textit{over-refusal}$ that reduces the practical utility of T2I models. Despite over-refusal having been observed in practice, there is no large-scale benchmark that systematically evaluates this phenomenon for T2I models. In this paper, we present an automatic workflow to construct synthetic evaluation data, resulting in OVERT ($\textbf{OVE}$r-$\textbf{R}$efusal evaluation on $\textbf{T}$ext-to-image models), the first large-scale benchmark for assessing over-refusal behaviors in T2I models. OVERT includes 4,600 seemingly harmful but benign prompts across nine safety-related categories, along with 1,785 genuinely harmful prompts (OVERT-unsafe) to evaluate the safety-utility trade-off. Using OVERT, we evaluate several leading T2I models and find that over-refusal is a widespread issue across various categories (Figure 1), underscoring the need for further research to enhance the safety alignment of T2I models without compromising their functionality. As a preliminary attempt to reduce over-refusal, we explore prompt rewriting; however, we find it often compromises faithfulness to the meaning of the original prompts. Finally, we demonstrate the flexibility of our generation framework in accommodating diverse safety requirements by generating customized evaluation data adapting to user-defined policies.

OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models

TL;DR

OVERT introduces the first large-scale benchmark for evaluating over-refusal in text-to-image models by automatically generating 4,600 benign prompts across nine safety categories and 1,785 genuinely harmful prompts (OVERT-unsafe). The authors design a scalable pipeline using WildGuardMix seeds, Gemini-2.0-Flash conversions, and rigorous post-processing (including human auditing and rejection sampling) to create high-quality evaluation data. Experiments across five frontier T2I models reveal that over-refusal is widespread and positively correlates with safety strictness, underscoring a critical safety-utility trade-off. The study also investigates prompt rewriting as a mitigation (finding fidelity loss) and demonstrates how the framework supports dynamic safety policy adaptation to align with different norms and use cases.

Abstract

Text-to-Image (T2I) models have achieved remarkable success in generating visual content from text inputs. Although multiple safety alignment strategies have been proposed to prevent harmful outputs, they often lead to overly cautious behavior -- rejecting even benign prompts -- a phenomenon known as that reduces the practical utility of T2I models. Despite over-refusal having been observed in practice, there is no large-scale benchmark that systematically evaluates this phenomenon for T2I models. In this paper, we present an automatic workflow to construct synthetic evaluation data, resulting in OVERT (r-efusal evaluation on ext-to-image models), the first large-scale benchmark for assessing over-refusal behaviors in T2I models. OVERT includes 4,600 seemingly harmful but benign prompts across nine safety-related categories, along with 1,785 genuinely harmful prompts (OVERT-unsafe) to evaluate the safety-utility trade-off. Using OVERT, we evaluate several leading T2I models and find that over-refusal is a widespread issue across various categories (Figure 1), underscoring the need for further research to enhance the safety alignment of T2I models without compromising their functionality. As a preliminary attempt to reduce over-refusal, we explore prompt rewriting; however, we find it often compromises faithfulness to the meaning of the original prompts. Finally, we demonstrate the flexibility of our generation framework in accommodating diverse safety requirements by generating customized evaluation data adapting to user-defined policies.

Paper Structure

This paper contains 41 sections, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Refusal rates of Text-to-Image (T2I) models on benign prompts (x-axis, OVERT-mini) and safe response rate on harmful prompts (y-axis, OVERT-unsafe), grouped into four broad safety categories. Each point corresponds to a specific model's refusal rate within one broad category, obtained by aggregating across related subsets of the nine fine-grained categories. The dashed curve shows a quadratic regression fit, highlighting the trade-off between safety and over-refusal. Detailed results by category are shown in Table \ref{['tab:evaluation-mini']} and \ref{['tab:evaluation-mini-unsafe']}, with category definitions in Table \ref{['tab:description']}.
  • Figure 2: Left: Category distribution of the 4,600 prompts in OVERT. Right: A benign prompt from OVERT is refused by FLUX1.1-Pro and DALL-E-3, but accepted by Imagen-3 and SD-3.5.
  • Figure 3: OVERT dataset construction pipeline. Prompts are generated via LLMs from WildGuardMix or templates, filtered and audited for safety, deduplicated, and sampled using Chameleon. The final dataset is used to evaluate over-refusal in T2I models.
  • Figure 4: Illustration of dynamic policy adaptation in the copyright violation category. Original prompts are converted into T2I prompts under two policy templates: a broad policy (top) and a stricter variant limited to works by deceased authors (bottom). Highlighted regions show how modified policies influence the generated prompts.
  • Figure 5: WildGuardMix classification results. We use GPT-4o to classify the prompts in WildGuardMix and verify the results via a human auditing experiment with agreement score $80.5\%$.
  • ...and 3 more figures