Automatic Jailbreaking of the Text-to-Image Generative AI Systems
Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang
TL;DR
<3-5 sentence high-level summary> The paper tackles the risk of copyright infringement in text-to-image (T2I) systems by showing that even systems with safety blocks can be bypassed using automated jailbreaking prompts. It introduces the Automatic Prompt Generation Pipeline (APGP), which combines seed-prompt search via vision-language models with an LLM-based prompt revision that optimizes a composite score including image consistency, text alignment, keyword penalties, and a self-generated QA signal, plus suffix prompt injections. A new VioT dataset across five IP categories is used to benchmark risk, revealing substantial infringement even in systems with prior blocking, notably achieving only 11% blocking on ChatGPT with APGP prompts and human-verified infringement in 76% of cases. The work also evaluates defense strategies (post-generation filtering, concept unlearning, and copyright detectors) and finds them inadequate, underscoring the need for stronger, more robust safety mechanisms and formal IP protections in real-world AI systems.
Abstract
Recent AI systems have shown extremely powerful performance, even surpassing human performance, on various tasks such as information retrieval, language generation, and image generation based on large language models (LLMs). At the same time, there are diverse safety risks that can cause the generation of malicious contents by circumventing the alignment in LLMs, which are often referred to as jailbreaking. However, most of the previous works only focused on the text-based jailbreaking in LLMs, and the jailbreaking of the text-to-image (T2I) generation system has been relatively overlooked. In this paper, we first evaluate the safety of the commercial T2I generation systems, such as ChatGPT, Copilot, and Gemini, on copyright infringement with naive prompts. From this empirical study, we find that Copilot and Gemini block only 12% and 17% of the attacks with naive prompts, respectively, while ChatGPT blocks 84% of them. Then, we further propose a stronger automated jailbreaking pipeline for T2I generation systems, which produces prompts that bypass their safety guards. Our automated jailbreaking framework leverages an LLM optimizer to generate prompts to maximize degree of violation from the generated images without any weight updates or gradient computation. Surprisingly, our simple yet effective approach successfully jailbreaks the ChatGPT with 11.0% block rate, making it generate copyrighted contents in 76% of the time. Finally, we explore various defense strategies, such as post-generation filtering and machine unlearning techniques, but found that they were inadequate, which suggests the necessity of stronger defense mechanisms.
