Table of Contents
Fetching ...

Harnessing LLM to Attack LLM-Guarded Text-to-Image Models

Yimo Deng, Huangxun Chen

TL;DR

This paper demonstrates that rephrasing a drawing intent into multiple benign descriptions of individual visual components can obtain an effective adversarial prompt, and proposes a LLM-piloted multi-agent method named DACA to automatically complete intended rephrasing.

Abstract

To prevent Text-to-Image (T2I) models from generating unethical images, people deploy safety filters to block inappropriate drawing prompts. Previous works have employed token replacement to search adversarial prompts that attempt to bypass these filters, but they have become ineffective as nonsensical tokens fail semantic logic checks. In this paper, we approach adversarial prompts from a different perspective. We demonstrate that rephrasing a drawing intent into multiple benign descriptions of individual visual components can obtain an effective adversarial prompt. We propose a LLM-piloted multi-agent method named DACA to automatically complete intended rephrasing. Our method successfully bypasses the safety filters of DALL-E 3 and Midjourney to generate the intended images, achieving success rates of up to 76.7% and 64% in the one-time attack, and 98% and 84% in the re-use attack, respectively. We open-source our code and dataset on [this link](https://github.com/researchcode003/DACA).

Harnessing LLM to Attack LLM-Guarded Text-to-Image Models

TL;DR

This paper demonstrates that rephrasing a drawing intent into multiple benign descriptions of individual visual components can obtain an effective adversarial prompt, and proposes a LLM-piloted multi-agent method named DACA to automatically complete intended rephrasing.

Abstract

To prevent Text-to-Image (T2I) models from generating unethical images, people deploy safety filters to block inappropriate drawing prompts. Previous works have employed token replacement to search adversarial prompts that attempt to bypass these filters, but they have become ineffective as nonsensical tokens fail semantic logic checks. In this paper, we approach adversarial prompts from a different perspective. We demonstrate that rephrasing a drawing intent into multiple benign descriptions of individual visual components can obtain an effective adversarial prompt. We propose a LLM-piloted multi-agent method named DACA to automatically complete intended rephrasing. Our method successfully bypasses the safety filters of DALL-E 3 and Midjourney to generate the intended images, achieving success rates of up to 76.7% and 64% in the one-time attack, and 98% and 84% in the re-use attack, respectively. We open-source our code and dataset on [this link](https://github.com/researchcode003/DACA).
Paper Structure (16 sections, 1 equation, 7 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 1 equation, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Visual Rephrase Prompt Against T2I Model's Safety Filter: The blue curve represents the safety filter's semantic safe/unsafe boundary $\mathcal{B}_s$, while the red dashed curve represents the logical/illogical boundary $\mathcal{B}_l$. The safety filter will reject prompts that are either harmful or illogical. By design, our method finds a sanitized prompt through visual rephrasing, enabling it to bypass both safety filter boundaries and generate the intended images.
  • Figure 2: Overview of LLM-Piloted Multi-Agent Method. $\mathsf{Decomposer}$: decompose the key visual components based on the specified image ontology (Figure \ref{['fig-ontology']}); $\mathsf{Polisher}$: identify sensitive terms within each isolated component and finds alternative benign descriptions; $\mathsf{Assembler}$: reassemble associated components into coherent and fluent sentences based on the image ontology.
  • Figure 3: Image Ontology: A graph structure to capture the major visual components and their associations in targeted image.
  • Figure 4: Bypass Rate Distribution in Re-use Attack: X-axis: bypass rate per prompt in re-use attack; Y-axis: the proportion of evaluated re-used prompts that achieve a specific bypass rate.
  • Figure 5: CLIP-embeddings-based Cosine Similarity Score between Generated Image $\mathsf{T2I(T_{adv})}$ and Original Prompt $\mathsf{T}$.
  • ...and 2 more figures