Table of Contents
Fetching ...

Do not think about pink elephant!

Kyomin Hwang, Suyoung Kim, JunHoo Lee, Nojun Kwak

TL;DR

This paper shows that recent large models such as Stable Diffusion and DALL-E3 also share the vulnerability of human intelligence, namely the white bear phenomenon, and proposes a simple prompt-based attack method, which generates figures prohibited by the LM provider's policy.

Abstract

Large Models (LMs) have heightened expectations for the potential of general AI as they are akin to human intelligence. This paper shows that recent large models such as Stable Diffusion and DALL-E3 also share the vulnerability of human intelligence, namely the "white bear phenomenon". We investigate the causes of the white bear phenomenon by analyzing their representation space. Based on this analysis, we propose a simple prompt-based attack method, which generates figures prohibited by the LM provider's policy. To counter these attacks, we introduce prompt-based defense strategies inspired by cognitive therapy techniques, successfully mitigating attacks by up to 48.22\%.

Do not think about pink elephant!

TL;DR

This paper shows that recent large models such as Stable Diffusion and DALL-E3 also share the vulnerability of human intelligence, namely the white bear phenomenon, and proposes a simple prompt-based attack method, which generates figures prohibited by the LM provider's policy.

Abstract

Large Models (LMs) have heightened expectations for the potential of general AI as they are akin to human intelligence. This paper shows that recent large models such as Stable Diffusion and DALL-E3 also share the vulnerability of human intelligence, namely the "white bear phenomenon". We investigate the causes of the white bear phenomenon by analyzing their representation space. Based on this analysis, we propose a simple prompt-based attack method, which generates figures prohibited by the LM provider's policy. To counter these attacks, we introduce prompt-based defense strategies inspired by cognitive therapy techniques, successfully mitigating attacks by up to 48.22\%.
Paper Structure (18 sections, 7 figures, 3 tables)

This paper contains 18 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Example images that illustrate the "white bear phenomenon." The image generation model does not understand the negative word. Images are generated by using DALL-E3.
  • Figure 2: Histogram of the cosine similarity between two distinct CLS tokens obtained through the CLIP text encoder, with measurements presented in units of 1e-3. '$w^{hyp}$' is the hypernym of '$w$' and '$w^{syn}$' is a synonym to '$w$'.
  • Figure 3: Visualization by t-SNE of CLIP embeddings for various sentences.
  • Figure 4: The images provided illustrate the application of our proposed attack and defense strategies on the Stable Diffusion model. The image on the left demonstrates the outcome of an attack using the prompt "draw $w_{abs}$without $w_{con}$" while the those on the right show the results of employing defense strategies with the prompts "draw $w_{abs}$, which is $w_{abs}^{def}$, without $w_{con}$" and "draw $w_{abs}$, include $w_{con}^{1}$, instead of $w_{con}^{2}$", respectively.
  • Figure 5: The results of utilizing our proposed attack method on the DALL-E3, which uses LLM augmented prompts. Despite rephrasing the user input prompt through the LLM, it is observed that the model still generates a white bear.
  • ...and 2 more figures