SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution

Zhongjie Ba; Jieming Zhong; Jiachen Lei; Peng Cheng; Qinglong Wang; Zhan Qin; Zhibo Wang; Kui Ren

SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution

Zhongjie Ba, Jieming Zhong, Jiachen Lei, Peng Cheng, Qinglong Wang, Zhan Qin, Zhibo Wang, Kui Ren

TL;DR

This work successfully devise and exhibit the first prompt attacks on Midjourney, producing abundant photorealistic NSFW images and reveals the fundamental principles of such prompt attacks and strategically substitute high-risk sections within a suspect prompt to evade closed-source safety measures.

Abstract

Advanced text-to-image models such as DALL$\cdot$E 2 and Midjourney possess the capacity to generate highly realistic images, raising significant concerns regarding the potential proliferation of unsafe content. This includes adult, violent, or deceptive imagery of political figures. Despite claims of rigorous safety mechanisms implemented in these models to restrict the generation of not-safe-for-work (NSFW) content, we successfully devise and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant photorealistic NSFW images. We reveal the fundamental principles of such prompt attacks and suggest strategically substituting high-risk sections within a suspect prompt to evade closed-source safety measures. Our novel framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules to automate attack prompt creation at scale. Evaluation results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts, leading to the generation of counterfeit images depicting political figures in violent scenarios. Both subjective and objective assessments validate that the images generated from our attack prompts present considerable safety hazards.

SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution

TL;DR

Abstract

Advanced text-to-image models such as DALL

E 2 and Midjourney possess the capacity to generate highly realistic images, raising significant concerns regarding the potential proliferation of unsafe content. This includes adult, violent, or deceptive imagery of political figures. Despite claims of rigorous safety mechanisms implemented in these models to restrict the generation of not-safe-for-work (NSFW) content, we successfully devise and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant photorealistic NSFW images. We reveal the fundamental principles of such prompt attacks and suggest strategically substituting high-risk sections within a suspect prompt to evade closed-source safety measures. Our novel framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules to automate attack prompt creation at scale. Evaluation results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts, leading to the generation of counterfeit images depicting political figures in violent scenarios. Both subjective and objective assessments validate that the images generated from our attack prompts present considerable safety hazards.

Paper Structure (22 sections, 9 figures, 11 tables)

This paper contains 22 sections, 9 figures, 11 tables.

Introduction
Related Work
Security of Text-to-Image Models.
Adversarial Examples in Text-to-Image Models.
Problem Formulation
System and Threat Model
Attacker's Capabilities
Attacker's Goals
Bypassing Safety Control of A Commercial Text-to-Image Model
SurrogatePrompt: A Systematic Framework of Attack Prompt Generation
Automated Production of Attack Prompts and NSFW images
Evaluations
Experimental Setup
Existing Attack Methods Against Midjourney's Safety Measures
Evaluation of SurrogatePrompt Attack Performance on Midjourney Model
...and 7 more sections

Figures (9)

Figure 1: Demonstration of how attack prompts, constructed using a substitution-based approach, can bypass security controls and generate NSFW images.
Figure 2: Typical usage scenario of a text-to-image model, accompanied by a demonstration of the attack pipeline.
Figure 3: The SurrogatePrompt Pipeline. The shaded area represents the attack pipeline, while the remaining sections, marked with index numbers, depict the automated prompt construction pipelines. Within Part 1, a large language model (LLM) is employed to generate alternative expressions as substitutes for sensitive portions of a problematic prompt. In Part 2, Midjourney's image-to-text module is leveraged to acquire additional prompt variations. Lastly, in Part 3, Midjourney's image-to-image component is utilized to generate variant forms of NSFW images.
Figure 4: Examples of using SurrogatePrompt to generate fake images portraying political figures engaging in violent and bloody scenes.
Figure 5: Utilizing the i2t methodology, a prompt for the image on the left is generated, followed by utilizing our key idea to create a substitution. This process culminates in generating the diverse images depicted on the right.
...and 4 more figures

SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution

TL;DR

Abstract

SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution

Authors

TL;DR

Abstract

Table of Contents

Figures (9)