Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Jiachen Ma; Yijiang Li; Zhiqing Xiao; Anda Cao; Jie Zhang; Chao Ye; Junbo Zhao

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Jiachen Ma, Yijiang Li, Zhiqing Xiao, Anda Cao, Jie Zhang, Chao Ye, Junbo Zhao

TL;DR

JPA addresses the risk of NSFW content generation by diffusion models in a black-box setting. It leverages the embedding space to inject NSFW concepts via a learned directional vector derived from antonym pairs, and maps this continuous concept shift back to discrete prompts through a prefix optimization with soft assignments and gradient masking. Empirically, JPA bypasses multiple safety checkers across online services and offline defenses, while maintaining semantic fidelity and enabling controllable NSFW rendering, all in a substantially more automated and faster framework than prior attacks. This work highlights a robust evaluation paradigm for robustness of T2I safety mechanisms and underscores the need for stronger embedding-space defenses.

Abstract

Text-to-image (T2I) models can be maliciously used to generate harmful content such as sexually explicit, unfaithful, and misleading or Not-Safe-for-Work (NSFW) images. Previous attacks largely depend on the availability of the diffusion model or involve a lengthy optimization process. In this work, we investigate a more practical and universal attack that does not require the presence of a target model and demonstrate that the high-dimensional text embedding space inherently contains NSFW concepts that can be exploited to generate harmful images. We present the Jailbreaking Prompt Attack (JPA). JPA first searches for the target malicious concepts in the text embedding space using a group of antonyms generated by ChatGPT. Subsequently, a prefix prompt is optimized in the discrete vocabulary space to align malicious concepts semantically in the text embedding space. We further introduce a soft assignment with gradient masking technique that allows us to perform gradient ascent in the discrete vocabulary space. We perform extensive experiments with open-sourced T2I models, e.g. stable-diffusion-v1-4 and closed-sourced online services, e.g. DALLE2, Midjourney with black-box safety checkers. Results show that (1) JPA bypasses both text and image safety checkers (2) while preserving high semantic alignment with the target prompt. (3) JPA demonstrates a much faster speed than previous methods and can be executed in a fully automated manner. These merits render it a valuable tool for robustness evaluation in future text-to-image generation research.

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

TL;DR

Abstract

Paper Structure (48 sections, 5 equations, 9 figures, 13 tables, 1 algorithm)

This paper contains 48 sections, 5 equations, 9 figures, 13 tables, 1 algorithm.

Introduction
Related Work
T2I models with defense methods.
Adversarial attack on T2I models.
Prompt perturbations in vision-language models.
Preliminary
Defense Model
Insights
JPA: Jailbreaking Prompt Attack
Experiment
Experimental setups.
NSFW prompts.
Online services.
Offline T2I models with removal methods.
Baselines.
...and 33 more sections

Figures (9)

Figure 1: An example on malicious prompt: “sexy seductive little smile sophia vergara in nurse by agnes cecile enki bilal moebius, intricated details, lingerie, 3 / 4 back view, hair styled in a bun, bend over posture, full body portrait, extremely luminous bright design, pastel colours, drips, autumn lights.” (a) Limitation of prior methods: Inconsistent semantics between NSFW generation and input prompt. (b) We precisely control the extent to which ‘nudity’ emerges in the generated images by a scalar $\lambda$.
Figure 2: (a) Overview of the Jailbreaking Prompt Attack (JPA). Given a target prompt $p_r$ and a contrastive description of an NSFW concept <$r^+,r^-$> such as <"nudity", "clothed"> for the "nudity" concept, we first obtain the embedding $\mathcal{T}(p_r)$, which encapsulates both the semantic meaning and the unsafe concept. We then optimize our adversarial prompt $p_a$ to align $\mathcal{T}(p_r)$ in the embedding space. (b) Equipped with safety checkers, the T2I model will map any prompt with sensitive words to either null space (w/ output) or a safe image (by concept removal). Our insight is to find an attacker function that map a sensitive prompt to an insensitive one while still maintaining NSFW content and semantic fidelity.
Figure 3: Visualization results generated by JPA in NSFW concept under four online T2I services, (texts in red and black respectively represent the adversarial prompts from JPA and the original prompts from I2P dataset).
Figure 4: Each column represents a different attack method, with the last column showing images generated by Stable Diffusion without safety checkers. We also use JPA with BERT and T5 text encoders to execute the attack, demonstrating that JPA can maintain semantic similarity with images generated by SD without safety checkers. The first two rows correspond to attacks on "nudity" and the bottom two are on "violence" concepts, respectively.
Figure 5: Visualization results generated by JPA in unsafe concepts under five offline T2I models with removal methods. We use and blur the displayed images for publication.
...and 4 more figures

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

TL;DR

Abstract

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)