Adversarial Attacks and Defenses on Text-to-Image Diffusion Models: A Survey

Chenyu Zhang; Mingwang Hu; Wenhui Li; Lanjun Wang

Adversarial Attacks and Defenses on Text-to-Image Diffusion Models: A Survey

Chenyu Zhang, Mingwang Hu, Wenhui Li, Lanjun Wang

TL;DR

This survey addresses the vulnerabilities of text-to-image diffusion models, notably Stable Diffusion, to adversarial prompts affecting robustness and safety. It develops a three-dimensional taxonomy for attacks (target vs untargeted), perturbation level (character, word, sentence), and attacker knowledge (white-box vs black-box), then systematically reviews untargeted and targeted attacks and corresponding defense strategies. Key findings include that targeted attacks are more prevalent than untargeted ones, many perturbations remain perceptible, and safeguards often struggle against adversarial prompts, especially those generated via language models. The work underscores the need for holistic defenses that address both malicious prompts and adversarial prompts, and it outlines promising future directions such as LLM-driven multi-agent attack automation and pattern-based defense approaches with practical implications for deploying safe image synthesis systems.

Abstract

Recently, the text-to-image diffusion model has gained considerable attention from the community due to its exceptional image generation capability. A representative model, Stable Diffusion, amassed more than 10 million users within just two months of its release. This surge in popularity has facilitated studies on the robustness and safety of the model, leading to the proposal of various adversarial attack methods. Simultaneously, there has been a marked increase in research focused on defense methods to improve the robustness and safety of these models. In this survey, we provide a comprehensive review of the literature on adversarial attacks and defenses targeting text-to-image diffusion models. We begin with an overview of text-to-image diffusion models, followed by an introduction to a taxonomy of adversarial attacks and an in-depth review of existing attack methods. We then present a detailed analysis of current defense methods that improve model robustness and safety. Finally, we discuss ongoing challenges and explore promising future research directions. For a complete list of the adversarial attack and defense methods covered in this survey, please refer to our curated repository at https://github.com/datar001/Awesome-AD-on-T2IDM.

Adversarial Attacks and Defenses on Text-to-Image Diffusion Models: A Survey

TL;DR

Abstract

Paper Structure (45 sections, 8 equations, 7 figures, 1 table)

This paper contains 45 sections, 8 equations, 7 figures, 1 table.

Introduction
Text-to-Image Diffusion Model
Attacks
Taxonomy of Adversarial Attacks
Target
Perturbation
Knowledge
Untargeted Attack
White-box Attacks
Black-box Attacks
Summary of Untargeted Attacks
Targeted Attack
Safeguards Classification
Attacking External Safeguards
Attacking Internal Safeguards
...and 30 more sections

Figures (7)

Figure 1: Taxonomy of adversarial attacks on text-to-image models.
Figure 2: The perturbation strategies of untargeted attack methods. The red words in the adversarial prompt are the noise added to the clean prompt. (a) and (b), ATM du2023stable, the word-level perturbation by suffix addition and word substitution strategies. (c), Zhuang et al. 10208563, the word-level perturbation by appending an noise word with five characters. (d), Gao et al. gao2023evaluating, the character-level perturbation using the typo. In these untargeted attacks, the adversarial prompt in (a) is grammatically correct, whereas the prompts in (b), (c), and (d) are grammatically incorrect.
Figure 3: The common perturbation strategies of targeted attacks. The red words are the noise introduced within the input prompt. The safeguards in (a, sneakyprompt yang2023sneakyprompt), (e, MMA Yang2023MMADiffusionMA), and (f, Divide-and-Conquer deng2023divideandconquer) aim to filter the sexual, bloody, and copyright content. The safeguards in (b, Zhang et al. zhang2024revealing), (c, maus et al. maus2023black), (d, RIATIG 10205174) are designed to filter predefined content instead of malicious content, thereby preventing potential discomfort for the audience. (a)-(c), the word-level perturbation strategies. (a), the word substitution strategy that substitutes the malicious word within the input prompt. The image is blurred for the display. (b), the suffix addition strategy that appends a suffix to the input prompt. (c), the prefix addition strategy that appends a prefix to the input prompt. (d)-(f), the sentence-level perturbation strategies. (d), optimizing the adversarial prompt based on the input prompt, without the language fluency constraint. (e), optimizing the adversarial prompt by incorporating several noise words directly, without the language fluency constraint. (f), utilizing LLM to rewrite the adversarial prompt, ensuring sentence fluency and naturalness.
Figure 4: The taxonomy summary of existing adversarial attack methods on the text-to-image diffusion model. The blue and black rectangles represent the attack method and its category. The red characters (C, W, S) are the character-level, word-level, and sentence-level perturbation strategies, respectively.
Figure 5: Three external safeguards for text-to-image diffusion models. a, Latent Guard liu2024latent identifies the malicious prompt by training a prompt classifier. b, POSI wu2024universal aims to transform the malicious prompt into the safe prompt. c, GuardT2I yang2024guardt2i first transform the grammatically incorrect prompt to the natural language expression, followed by a blacklist filter to ensure the model safety.
...and 2 more figures

Adversarial Attacks and Defenses on Text-to-Image Diffusion Models: A Survey

TL;DR

Abstract

Adversarial Attacks and Defenses on Text-to-Image Diffusion Models: A Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (7)