Table of Contents
Fetching ...

MMA-Diffusion: MultiModal Attack on Diffusion Models

Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, Qiang Xu

TL;DR

MMA-Diffusion introduces a multimodal adversarial framework to stress-test diffusion-based text-to-image models by simultaneously attacking textual prompts and image inputs. It formalizes a text-attack that preserves semantics while avoiding sensitive words and an image-attack that perturbs inputs to bypass post-hoc safety checkers, then demonstrates strong transferability across open-source models and online services. Across SD, SDXL, SLD, DALL·E2, Midjourney, and Leonardo.Ai, the framework achieves high attack success rates, including ASR-4 up to around 84% on open models and 90% on Leonardo.Ai in black-box settings, plus robust multimodal effectiveness (ASR-4 ≈ 85%). The results reveal significant gaps in current safety mechanisms and advocate for stronger, multimodal defenses to mitigate NSFW content generation in diffusion-based T2I systems.

Abstract

In recent years, Text-to-Image (T2I) models have seen remarkable advancements, gaining widespread adoption. However, this progress has inadvertently opened avenues for potential misuse, particularly in generating inappropriate or Not-Safe-For-Work (NSFW) content. Our work introduces MMA-Diffusion, a framework that presents a significant and realistic threat to the security of T2I models by effectively circumventing current defensive measures in both open-source models and commercial online services. Unlike previous approaches, MMA-Diffusion leverages both textual and visual modalities to bypass safeguards like prompt filters and post-hoc safety checkers, thus exposing and highlighting the vulnerabilities in existing defense mechanisms.

MMA-Diffusion: MultiModal Attack on Diffusion Models

TL;DR

MMA-Diffusion introduces a multimodal adversarial framework to stress-test diffusion-based text-to-image models by simultaneously attacking textual prompts and image inputs. It formalizes a text-attack that preserves semantics while avoiding sensitive words and an image-attack that perturbs inputs to bypass post-hoc safety checkers, then demonstrates strong transferability across open-source models and online services. Across SD, SDXL, SLD, DALL·E2, Midjourney, and Leonardo.Ai, the framework achieves high attack success rates, including ASR-4 up to around 84% on open models and 90% on Leonardo.Ai in black-box settings, plus robust multimodal effectiveness (ASR-4 ≈ 85%). The results reveal significant gaps in current safety mechanisms and advocate for stronger, multimodal defenses to mitigate NSFW content generation in diffusion-based T2I systems.

Abstract

In recent years, Text-to-Image (T2I) models have seen remarkable advancements, gaining widespread adoption. However, this progress has inadvertently opened avenues for potential misuse, particularly in generating inappropriate or Not-Safe-For-Work (NSFW) content. Our work introduces MMA-Diffusion, a framework that presents a significant and realistic threat to the security of T2I models by effectively circumventing current defensive measures in both open-source models and commercial online services. Unlike previous approaches, MMA-Diffusion leverages both textual and visual modalities to bypass safeguards like prompt filters and post-hoc safety checkers, thus exposing and highlighting the vulnerabilities in existing defense mechanisms.
Paper Structure (33 sections, 2 equations, 13 figures, 6 tables, 1 algorithm)

This paper contains 33 sections, 2 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: Our attack framework harnesses both textual and visual modalities to bypass safeguards such as prompt filters (a) and post-hoc safety checkers (b), generating semantically-rich NSFW images and illuminating vulnerabilities in current defense mechanisms.
  • Figure 2: Overview of the proposed framework. T2I models incorporate safety mechanisms, including (a) prompt filters to prohibit unsafe prompts/words, e.g."naked," and (b) post-hoc safety checkers to prevent explicit synthesis. (c) Our attack framework aims to evaluate the robustness of these safety mechanisms by conducting text and image modality attacks. Our framework exposes the vulnerabilities in T2I models when it comes to unauthorized editing of real individuals' imagery with NSFW content.
  • Figure 3: Adversarial prompt generation strategy.
  • Figure 4: Adversarial image generation strategy.
  • Figure 5: Visualization results of text-modal attacks. Sensitive words within the target prompt are colored in red. (a) Syntheses generated by vanilla T2I without defensive mechanisms. (b) Syntheses prompted by QF-Attack (Greedy). (c) Our syntheses can faithfully reflect the target prompt without mentioning sensitive words. Images are plotted with SDXLv1.0.
  • ...and 8 more figures