Table of Contents
Fetching ...

On the Multi-modal Vulnerability of Diffusion Models

Dingcheng Yang, Yang Bai, Xiaojun Jia, Yang Liu, Xiaochun Cao, Wenjian Yu

TL;DR

The study reveals a cross-modal vulnerability in diffusion-based T2I systems by showing text prompts are dispersed in feature space while image prompts cluster by object, indicating robustness gaps. It introduces MMP-Attack, a gradient-based, discrete optimization method that appends a multi-modal suffix to the original prompt to steer generation toward a target object while suppressing the original, using both image- and text-based CLIP targets with a loss balanced by $\lambda$. Empirical results on COCO-derived categories demonstrate strong attack performance, high universality across prompts, and transferability to multiple diffusion models and even black-box commercial services, with attack efficacy improving when both modalities are used. These findings underscore significant security concerns in AIGC and motivate the development of defenses against multi-modal prompt manipulation. The approach combines multi-modal priors, STE-based discrete optimization, and cross-model evaluation to advance understanding of diffusion-model robustness and prompt-based adversarial strategies, with practical implications for prompt screening and model-provider safeguards.

Abstract

Diffusion models have been widely deployed in various image generation tasks, demonstrating an extraordinary connection between image and text modalities. Although prior studies have explored the vulnerability of diffusion models from the perspectives of text and image modalities separately, the current research landscape has not yet thoroughly investigated the vulnerabilities that arise from the integration of multiple modalities, specifically through the joint analysis of textual and visual features. In this paper, we are the first to visualize both text and image feature space embedded by diffusion models and observe a significant difference. The prompts are embedded chaotically in the text feature space, while in the image feature space they are clustered according to their subjects. These fascinating findings may underscore a potential misalignment in robustness between the two modalities that exists within diffusion models. Based on this observation, we propose MMP-Attack, which leverages multi-modal priors (MMP) to manipulate the generation results of diffusion models by appending a specific suffix to the original prompt. Specifically, our goal is to induce diffusion models to generate a specific object while simultaneously eliminating the original object. Our MMP-Attack shows a notable advantage over existing studies with superior manipulation capability and efficiency. Our code is publicly available at \url{https://github.com/ydc123/MMP-Attack}.

On the Multi-modal Vulnerability of Diffusion Models

TL;DR

The study reveals a cross-modal vulnerability in diffusion-based T2I systems by showing text prompts are dispersed in feature space while image prompts cluster by object, indicating robustness gaps. It introduces MMP-Attack, a gradient-based, discrete optimization method that appends a multi-modal suffix to the original prompt to steer generation toward a target object while suppressing the original, using both image- and text-based CLIP targets with a loss balanced by . Empirical results on COCO-derived categories demonstrate strong attack performance, high universality across prompts, and transferability to multiple diffusion models and even black-box commercial services, with attack efficacy improving when both modalities are used. These findings underscore significant security concerns in AIGC and motivate the development of defenses against multi-modal prompt manipulation. The approach combines multi-modal priors, STE-based discrete optimization, and cross-model evaluation to advance understanding of diffusion-model robustness and prompt-based adversarial strategies, with practical implications for prompt screening and model-provider safeguards.

Abstract

Diffusion models have been widely deployed in various image generation tasks, demonstrating an extraordinary connection between image and text modalities. Although prior studies have explored the vulnerability of diffusion models from the perspectives of text and image modalities separately, the current research landscape has not yet thoroughly investigated the vulnerabilities that arise from the integration of multiple modalities, specifically through the joint analysis of textual and visual features. In this paper, we are the first to visualize both text and image feature space embedded by diffusion models and observe a significant difference. The prompts are embedded chaotically in the text feature space, while in the image feature space they are clustered according to their subjects. These fascinating findings may underscore a potential misalignment in robustness between the two modalities that exists within diffusion models. Based on this observation, we propose MMP-Attack, which leverages multi-modal priors (MMP) to manipulate the generation results of diffusion models by appending a specific suffix to the original prompt. Specifically, our goal is to induce diffusion models to generate a specific object while simultaneously eliminating the original object. Our MMP-Attack shows a notable advantage over existing studies with superior manipulation capability and efficiency. Our code is publicly available at \url{https://github.com/ydc123/MMP-Attack}.
Paper Structure (21 sections, 5 equations, 12 figures, 7 tables, 1 algorithm)

This paper contains 21 sections, 5 equations, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: Visualization of 400 samples in text (left) and image (right) feature space embedded by Stable Diffusion v1.4 (SD v14). Text features are chaotic while image features are clustered.
  • Figure 2: Euclidean distances between 12 different prompts in the text (left) and image (right) feature spaces. The prompts are generated from 3 different templates: 'a {noun} is sitting on a bench in a park', 'a {noun} is peeking out from behind a curtain', and 'a {noun} is standing at the edge of a cliff', denoted as $T_1$, $T_2$, and $T_3$, respectively. '-C', '-D', '-P', and '-B' represent the {noun} being cat, dog, person, and bird respectively.
  • Figure 3: An illustration of the proposed MMP-Attack flow.
  • Figure 4: Examples of optimized cheating suffixes (marked in red) and their corresponding generated images.
  • Figure 5: The images generated by SD v14 using different cheating suffixes (marked in red). The top four images are generated using the cheating suffix we optimized. The bottom four images are respectively generated using each of the four individual tokens as the cheating suffix.
  • ...and 7 more figures