Table of Contents
Fetching ...

Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts

Zhi-Yi Chin, Chieh-Ming Jiang, Ching-Chun Huang, Pin-Yu Chen, Wei-Chen Chiu

TL;DR

The paper tackles the risk that safety mechanisms in text-to-image diffusion models may be bypassed by cleverly engineered prompts. It introduces Prompting4Debugging (P4D), a prompt-engineering–based red-teaming framework that leverages an unconstrained diffusion model as a reference to automatically discover safety-evasive prompts in latent-space, across multiple safety mechanisms. The key contributions include showing that about half of previously “safe” prompts can be jailbroken, revealing an information obfuscation effect when safety filters are active, demonstrating cross-model transferability of jailbreak prompts, and providing a dataset to support defense development. The work emphasizes the need for comprehensive, automated testing of safety systems and offers baseline defenses and insights to strengthen future safeguards against misuse in open and closed-source T2I platforms.

Abstract

Text-to-image diffusion models, e.g. Stable Diffusion (SD), lately have shown remarkable ability in high-quality content generation, and become one of the representatives for the recent wave of transformative AI. Nevertheless, such advance comes with an intensifying concern about the misuse of this generative technology, especially for producing copyrighted or NSFW (i.e. not safe for work) images. Although efforts have been made to filter inappropriate images/prompts or remove undesirable concepts/styles via model fine-tuning, the reliability of these safety mechanisms against diversified problematic prompts remains largely unexplored. In this work, we propose Prompting4Debugging (P4D) as a debugging and red-teaming tool that automatically finds problematic prompts for diffusion models to test the reliability of a deployed safety mechanism. We demonstrate the efficacy of our P4D tool in uncovering new vulnerabilities of SD models with safety mechanisms. Particularly, our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms, including concept removal, negative prompt, and safety guidance. Our findings suggest that, without comprehensive testing, the evaluations on limited safe prompting benchmarks can lead to a false sense of safety for text-to-image models.

Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts

TL;DR

The paper tackles the risk that safety mechanisms in text-to-image diffusion models may be bypassed by cleverly engineered prompts. It introduces Prompting4Debugging (P4D), a prompt-engineering–based red-teaming framework that leverages an unconstrained diffusion model as a reference to automatically discover safety-evasive prompts in latent-space, across multiple safety mechanisms. The key contributions include showing that about half of previously “safe” prompts can be jailbroken, revealing an information obfuscation effect when safety filters are active, demonstrating cross-model transferability of jailbreak prompts, and providing a dataset to support defense development. The work emphasizes the need for comprehensive, automated testing of safety systems and offers baseline defenses and insights to strengthen future safeguards against misuse in open and closed-source T2I platforms.

Abstract

Text-to-image diffusion models, e.g. Stable Diffusion (SD), lately have shown remarkable ability in high-quality content generation, and become one of the representatives for the recent wave of transformative AI. Nevertheless, such advance comes with an intensifying concern about the misuse of this generative technology, especially for producing copyrighted or NSFW (i.e. not safe for work) images. Although efforts have been made to filter inappropriate images/prompts or remove undesirable concepts/styles via model fine-tuning, the reliability of these safety mechanisms against diversified problematic prompts remains largely unexplored. In this work, we propose Prompting4Debugging (P4D) as a debugging and red-teaming tool that automatically finds problematic prompts for diffusion models to test the reliability of a deployed safety mechanism. We demonstrate the efficacy of our P4D tool in uncovering new vulnerabilities of SD models with safety mechanisms. Particularly, our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms, including concept removal, negative prompt, and safety guidance. Our findings suggest that, without comprehensive testing, the evaluations on limited safe prompting benchmarks can lead to a false sense of safety for text-to-image models.
Paper Structure (21 sections, 3 equations, 6 figures, 16 tables)

This paper contains 21 sections, 3 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Given an existing text-to-image (T2I) diffusion model ${\mathcal{G}}'$ with safety mechanism which ideally can remove the target concept (e.g. nudity) from the generated image (while the same input prompt would lead to inappropriate content for the typical T2I diffusion model ${\mathcal{G}}$), our proposed Prompting4Debugging (P4D) red-teams ${\mathcal{G}}'$ to automatically uncover the safety-evasive prompts.
  • Figure 2: An overview of our Prompting4Debugging (P4D) framework, which employs prompt engineering techniques to red-team the text-to-image (T2I) diffusion model ${\mathcal{G}}'$ with safety mechanism (e.g. Stable Diffusion with negative prompts rombach2022high, SLD schramowski2023safe, and ESD gandikota2023erasing). With the help of an unconstrained T2I diffusion model $\mathcal{G}$, our P4D optimizes to find the safety-evasive prompts (i.e. $P^\ast_{\text{cont}}$) which can bypass the safety mechanism in ${\mathcal{G}}'$ and still lead to generation of inappropriate image concept/objects (e.g. nudity). Such optimization procedure has three sequential steps, please refer to Section \ref{['sec:method']}.
  • Figure 3: Visualization of images generated by different prompts (i.e. indicated byy the sentence below the image) and T2I models (i.e. indicated by the model name on top of the image). Problematic prompts found by our P4D are colored in dark red. Notably, P4D demonstrates the capability to jailbreak safe T2I models and create images containing specific target concepts or objects that should have been cons by safe T2I models.
  • Figure 4: Visualization of more images generated by different prompts and T2I models. The images are generated using the displayed prompts (i.e. the sentence below the image) with the specified T2I models (i.e. indicated by the model name on top of the image). Problematic prompts found by our P4D are colored in dark red.
  • Figure 5: Visualization of images generated from general problematic prompts found by different safe T2I models with P4D-$N$ and P4D-$K$.
  • ...and 1 more figures