Table of Contents
Fetching ...

DREAM: Scalable Red Teaming for Text-to-Image Generative Systems via Distribution Modeling

Boheng Li, Junjie Wang, Yiming Li, Zhiyang Hu, Leyi Qi, Jianshuo Dong, Run Wang, Han Qiu, Zhan Qin, Tianwei Zhang

TL;DR

DREAM reframes red-teaming for text-to-image models as distribution learning over unsafe prompts, enabling scalable and diverse discovery via energy-based modeling. It introduces GC-SPSA, a gradient-calibrated zeroth-order optimizer, and an inference-time adaptive temperature strategy to efficiently sample a broad unsafe prompt space. Through extensive cross-model and cross-filter evaluation, DREAM achieves superior prompt success rates while maintaining diversity comparable to human-written prompts, and demonstrates transferability to commercial platforms. The framework also supports safety tuning and offers insights into reusability of the red-team LLM across targets. Overall, DREAM provides a principled, scalable pathway to rigorously evaluate and strengthen the safety of T2I systems before real-world deployment.

Abstract

Despite the integration of safety alignment and external filters, text-to-image (T2I) generative systems are still susceptible to producing harmful content, such as sexual or violent imagery. This raises serious concerns about unintended exposure and potential misuse. Red teaming, which aims to proactively identify diverse prompts that can elicit unsafe outputs from the T2I system, is increasingly recognized as an essential method for assessing and improving safety before real-world deployment. However, existing automated red teaming approaches often treat prompt discovery as an isolated, prompt-level optimization task, which limits their scalability, diversity, and overall effectiveness. To bridge this gap, in this paper, we propose DREAM, a scalable red teaming framework to automatically uncover diverse problematic prompts from a given T2I system. Unlike prior work that optimizes prompts individually, DREAM directly models the probabilistic distribution of the target system's problematic prompts, which enables explicit optimization over both effectiveness and diversity, and allows efficient large-scale sampling after training. To achieve this without direct access to representative training samples, we draw inspiration from energy-based models and reformulate the objective into a simple and tractable form. We further introduce GC-SPSA, an efficient optimization algorithm that provides stable gradient estimates through the long and potentially non-differentiable T2I pipeline. During inference, we also propose a diversity-aware sampling strategy to enhance prompt variety. The effectiveness of DREAM is validated through extensive experiments, demonstrating state-of-the-art performance across a wide range of T2I models and safety filters in terms of both prompt success rate and diversity. Our code is available at https://github.com/AntigoneRandy/DREAM

DREAM: Scalable Red Teaming for Text-to-Image Generative Systems via Distribution Modeling

TL;DR

DREAM reframes red-teaming for text-to-image models as distribution learning over unsafe prompts, enabling scalable and diverse discovery via energy-based modeling. It introduces GC-SPSA, a gradient-calibrated zeroth-order optimizer, and an inference-time adaptive temperature strategy to efficiently sample a broad unsafe prompt space. Through extensive cross-model and cross-filter evaluation, DREAM achieves superior prompt success rates while maintaining diversity comparable to human-written prompts, and demonstrates transferability to commercial platforms. The framework also supports safety tuning and offers insights into reusability of the red-team LLM across targets. Overall, DREAM provides a principled, scalable pathway to rigorously evaluate and strengthen the safety of T2I systems before real-world deployment.

Abstract

Despite the integration of safety alignment and external filters, text-to-image (T2I) generative systems are still susceptible to producing harmful content, such as sexual or violent imagery. This raises serious concerns about unintended exposure and potential misuse. Red teaming, which aims to proactively identify diverse prompts that can elicit unsafe outputs from the T2I system, is increasingly recognized as an essential method for assessing and improving safety before real-world deployment. However, existing automated red teaming approaches often treat prompt discovery as an isolated, prompt-level optimization task, which limits their scalability, diversity, and overall effectiveness. To bridge this gap, in this paper, we propose DREAM, a scalable red teaming framework to automatically uncover diverse problematic prompts from a given T2I system. Unlike prior work that optimizes prompts individually, DREAM directly models the probabilistic distribution of the target system's problematic prompts, which enables explicit optimization over both effectiveness and diversity, and allows efficient large-scale sampling after training. To achieve this without direct access to representative training samples, we draw inspiration from energy-based models and reformulate the objective into a simple and tractable form. We further introduce GC-SPSA, an efficient optimization algorithm that provides stable gradient estimates through the long and potentially non-differentiable T2I pipeline. During inference, we also propose a diversity-aware sampling strategy to enhance prompt variety. The effectiveness of DREAM is validated through extensive experiments, demonstrating state-of-the-art performance across a wide range of T2I models and safety filters in terms of both prompt success rate and diversity. Our code is available at https://github.com/AntigoneRandy/DREAM

Paper Structure

This paper contains 32 sections, 2 theorems, 32 equations, 8 figures, 15 tables, 3 algorithms.

Key Result

Theorem 1

Let $\|g_{\text{true}}\|$ be the ground-truth gradient, $\bar{g}_k$ be the vanilla SPSA estimator, and $\hat{g}_k = \bar{g}_k + H_k \hat{g}_{k-1}$ be the GC-SPSA estimator, with $\hat{g}_0 = \bar{g}_0$ and $H_k > 0$. Then for all $t \ge 1$, the SNR difference between the GC-SPSA and the vanilla SPSA where $h_t = 1$, $h_k = \prod_{j=k+1}^{t} H_j$ and $H_j = \gamma\frac{w_{j-1}}{w_{j-1}+n_j}$. $V_{\

Figures (8)

  • Figure 1: User study results on prompt success rate.
  • Figure 2: User study results on prompt diversity.
  • Figure 3: Efficiency comparison with baselines. We report the expected total time to collect a specified number of effective unsafe prompts, averaged across all evaluated (a) safety-aligned models and (b) safety filters. The considered unsafe concept is sexual.
  • Figure 4: PSR results of SD v1.5 safety-aligned with red team datasets generated by different methods.
  • Figure 5: Impact of number of sampled prompts on PSR and PS. We report the results computed over three independent samples.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Definition 1: Red Teaming T2I Systems
  • Definition 2: SPSA malladi2023fine
  • Theorem 1: Improved SNR of GC-SPSA
  • Theorem 2: Global Convergence and Rate Analysis of GC-SPSA