DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

Pucheng Dang; Xing Hu; Dong Li; Rui Zhang; Qi Guo; Kaidi Xu

DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

Pucheng Dang, Xing Hu, Dong Li, Rui Zhang, Qi Guo, Kaidi Xu

TL;DR

This paper tackles safety risks in text-to-image diffusion models by introducing DiffZOO, a purely black-box red-teaming method that requires only API access to craft attack prompts. It leverages Zeroth Order Optimization to estimate gradients in a discrete prompt space through continuous position replacement vectors (C-PRV) and discrete position replacement vectors (D-PRV), enabling token substitutions without a text encoder. The approach shows that DiffZOO can significantly improve attack success rates across multiple safety mechanisms and online services, outperforming prior black-box methods by notable margins (e.g., an average ASR increase of at least $8.5\%$). This work provides a practical, gradient-free tool for evaluating and stress-testing the robustness of T2I diffusion models against red-teaming attempts, while acknowledging potential ethical and regulatory considerations around misuse.

Abstract

Current text-to-image (T2I) synthesis diffusion models raise misuse concerns, particularly in creating prohibited or not-safe-for-work (NSFW) images. To address this, various safety mechanisms and red teaming attack methods are proposed to enhance or expose the T2I model's capability to generate unsuitable content. However, many red teaming attack methods assume knowledge of the text encoders, limiting their practical usage. In this work, we rethink the case of \textit{purely black-box} attacks without prior knowledge of the T2l model. To overcome the unavailability of gradients and the inability to optimize attacks within a discrete prompt space, we propose DiffZOO which applies Zeroth Order Optimization to procure gradient approximations and harnesses both C-PRV and D-PRV to enhance attack prompts within the discrete prompt domain. We evaluated our method across multiple safety mechanisms of the T2I diffusion model and online servers. Experiments on multiple state-of-the-art safety mechanisms show that DiffZOO attains an 8.5% higher average attack success rate than previous works, hence its promise as a practical red teaming tool for T2l models.

DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

TL;DR

). This work provides a practical, gradient-free tool for evaluating and stress-testing the robustness of T2I diffusion models against red-teaming attempts, while acknowledging potential ethical and regulatory considerations around misuse.

Abstract

Paper Structure (39 sections, 6 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 39 sections, 6 equations, 11 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Safety Mechanisms for Diffusion Model.
Attack for Text-to-Image Diffusion Model.
Preliminary
Threat Model
White-box Settings.
Black-box Settings.
Approach Overview
Optimization in Discrete Prompt Domain.
Gradient Unavailable.
Method
C-PRV and D-PRV
Zeroth Order Optimization
DiffZOO
...and 24 more sections

Figures (11)

Figure 1: First row: The pipeline of attack methods that aim to evaluate the T2I diffusion model's safety mechanisms to find problematic prompts with the ability to reveal inappropriate concepts (such as “nudity” and “violence”). Second row: our black-box attack DiffZOO discards the shortcomings of the previous work and constructs attack prompts by purely querying the T2I generative model's API.
Figure 2: An overview of DiffZOO. DiffZOO utilize continuous position replacement vectors (C-PRV) and subsequently sample from it to derive discrete position replacement vectors (D-PRV). By Zeroth Order Optimizing C-PRV and using D-PRV to construct attack prompts, DiffZOO can determine whether each token of the prompt needs to be replaced and, if so, with which synonym to convert the initial prompt to an attack prompt.
Figure 3: A replacement example from "attractive naked woman" to "charm naked woman".
Figure 4: Visualization of images generated from inappropriate prompts generated by DiffZOO via SOTA concept removal methods. We use $*$ and blurring for publication purposes. Additional visualization comparison results featuring attack prompts can be found in Appendix \ref{['ap-visual']}.
Figure 5: Visualization of images generated from inappropriate prompts generated by DiffZOO via stability.ai. We use $*$ and blurring for publication purposes. The DALL·E 2 visualization results are presented in Appendix \ref{['ap-online']}.
...and 6 more figures

DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

TL;DR

Abstract

DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (11)