Automatic Jailbreaking of the Text-to-Image Generative AI Systems

Minseon Kim; Hyomin Lee; Boqing Gong; Huishuai Zhang; Sung Ju Hwang

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang

TL;DR

<3-5 sentence high-level summary> The paper tackles the risk of copyright infringement in text-to-image (T2I) systems by showing that even systems with safety blocks can be bypassed using automated jailbreaking prompts. It introduces the Automatic Prompt Generation Pipeline (APGP), which combines seed-prompt search via vision-language models with an LLM-based prompt revision that optimizes a composite score including image consistency, text alignment, keyword penalties, and a self-generated QA signal, plus suffix prompt injections. A new VioT dataset across five IP categories is used to benchmark risk, revealing substantial infringement even in systems with prior blocking, notably achieving only 11% blocking on ChatGPT with APGP prompts and human-verified infringement in 76% of cases. The work also evaluates defense strategies (post-generation filtering, concept unlearning, and copyright detectors) and finds them inadequate, underscoring the need for stronger, more robust safety mechanisms and formal IP protections in real-world AI systems.

Abstract

Recent AI systems have shown extremely powerful performance, even surpassing human performance, on various tasks such as information retrieval, language generation, and image generation based on large language models (LLMs). At the same time, there are diverse safety risks that can cause the generation of malicious contents by circumventing the alignment in LLMs, which are often referred to as jailbreaking. However, most of the previous works only focused on the text-based jailbreaking in LLMs, and the jailbreaking of the text-to-image (T2I) generation system has been relatively overlooked. In this paper, we first evaluate the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts. From this empirical study, we find that Copilot and Gemini block only 12% and 17% of the attacks with naive prompts, respectively, while ChatGPT blocks 84% of them. Then, we further propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards. Our automated jailbreaking framework leverages an LLM optimizer to generate prompts to maximize degree of violation from the generated images without any weight updates or gradient computation. Surprisingly, our simple yet effective approach successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time. Finally, we explore various defense strategies, such as post-generation filtering and machine unlearning techniques, but found that they were inadequate, which suggests the necessity of stronger defense mechanisms.

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

TL;DR

Abstract

Paper Structure (47 sections, 1 equation, 24 figures, 8 tables)

This paper contains 47 sections, 1 equation, 24 figures, 8 tables.

Introduction
Preliminary
Copyright.
Memorization in T2I models.
Prompt attack in T2I models.
Automatic prompt generation pipeline for evaluating copyright violations
Searching seed prompt using vision-language models
Optimizing the prompts with keyword penalties and self-generated QA scores
Our score functions.
Optimizing prompt with automated prompt revision.
Suffix prompt injection
Experimental results
Dataset.
Experimental setup.
Evaluation step for ChatGPT.
...and 32 more sections

Figures (24)

Figure 1: Copyright violation cases and the potential usage scenarios of our approach. (a) Cases of the commercial T2I systems, ChatGPT and Copilot, generate copyrighted content, specifically Mickey Mouse, with our approach. (b) Our automatic prompt generation can be utilized in two scenarios: AI companies can use it for red-teaming to check model compliance with internal policy, and IP owners can leverage it to verify if their IPs are reproduced by commercial AI systems.
Figure 2: Concept figure of Automated Prompt Generation Pipeline (APGP). The initial step is to optimize the instruction for the vision-large language model (VLM) in order to search for a high-quality seed prompt that is well-aligned to the target image in the CLIP space. Then, the prompt for text-to-image (T2I) system is optimized based on the score function to generate a high-risk prompt that describes the target image precisely. The optimizing score at the revision optimization step comprises four scores, image-image consistency $S_{ii}$, image-text alignment score $S_{ti}$, keyword penalty $S_k$, and self-generated QA score $S_{qa}$.
Figure 3: Copyright violation cases of suffix prompt injection.$^1$
Figure 4: Generated images by ChatGPT with our prompts. (a) First/third rows are references and the second/fourth rows are generated images. (b) First/third columns are references and the second/fourth colums are generated images.
Figure 5: Automatic QA evaluation
...and 19 more figures

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

TL;DR

Abstract

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (24)