Table of Contents
Fetching ...

Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

Hang Li, Chengzhi Shen, Philip Torr, Volker Tresp, Jindong Gu

TL;DR

This work proposes a novel self-supervised approach to find interpretable latent directions for a given concept in an interpretable latent space of diffusion models as seman-tic concepts, and discovers vectors related to inappropriate concepts.

Abstract

Diffusion-based models have gained significant popularity for text-to-image generation due to their exceptional image-generation capabilities. A risk with these models is the potential generation of inappropriate content, such as biased or harmful images. However, the underlying reasons for generating such undesired content from the perspective of the diffusion model's internal representation remain unclear. Previous work interprets vectors in an interpretable latent space of diffusion models as semantic concepts. However, existing approaches cannot discover directions for arbitrary concepts, such as those related to inappropriate concepts. In this work, we propose a novel self-supervised approach to find interpretable latent directions for a given concept. With the discovered vectors, we further propose a simple approach to mitigate inappropriate generation. Extensive experiments have been conducted to verify the effectiveness of our mitigation approach, namely, for fair generation, safe generation, and responsible text-enhancing generation. Project page: \url{https://interpretdiffusion.github.io}.

Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

TL;DR

This work proposes a novel self-supervised approach to find interpretable latent directions for a given concept in an interpretable latent space of diffusion models as seman-tic concepts, and discovers vectors related to inappropriate concepts.

Abstract

Diffusion-based models have gained significant popularity for text-to-image generation due to their exceptional image-generation capabilities. A risk with these models is the potential generation of inappropriate content, such as biased or harmful images. However, the underlying reasons for generating such undesired content from the perspective of the diffusion model's internal representation remain unclear. Previous work interprets vectors in an interpretable latent space of diffusion models as semantic concepts. However, existing approaches cannot discover directions for arbitrary concepts, such as those related to inappropriate concepts. In this work, we propose a novel self-supervised approach to find interpretable latent directions for a given concept. With the discovered vectors, we further propose a simple approach to mitigate inappropriate generation. Extensive experiments have been conducted to verify the effectiveness of our mitigation approach, namely, for fair generation, safe generation, and responsible text-enhancing generation. Project page: \url{https://interpretdiffusion.github.io}.
Paper Structure (28 sections, 8 equations, 16 figures, 9 tables, 3 algorithms)

This paper contains 28 sections, 8 equations, 16 figures, 9 tables, 3 algorithms.

Figures (16)

  • Figure 1: Optimization framework to discover a semantic vector for a given concept. The top line shows that an image is firstly generated by the pretrained Stable Diffusion model for the prompt "a female face". The bottom part shows the optimization process for finding the concept for "female" in the semantic $h$-space. The concept vector is used to reconstruct the image along with a modified prompt "a face", under an iterative denoising process. With the pretrained diffusion model frozen, the gradients of the reconstruction loss can solely update the latent vector to represent the missing gender information. After convergence, the latent vector is aligned with the U-Net's internal representation of the "female" concept, which can be used to guide new image generation.
  • Figure 2: Fair Generation. Top: images generated from the prompt "doctor" are biased toward males. Bottom: we sample a learned male or female concept with equal probability for generating the doctors. The doctors now have fair gender. Images are generated from different random seeds.
  • Figure 3: Safe Generation. When the user's prompt contains implicit references to nudity, the original model (shown in the top row) generates an inappropriate image, as the added blurriness indicates. In contrast, our approach generates an image for the same prompt by setting a safety-related concept in $h$-space, identified in the previous section. The vector anti-sexual concept represents the direction to suppress nudity content, effectively eliminating inappropriate content while maintaining fidelity to the prompt.
  • Figure 4: Responsible text-enhancing generation. The original model may fail to capture the safety concepts specified in the text, such as "no violence". We propose extracting those safety concepts from the given prompt and activating the safety directions during generation. The bottom image demonstrates that incorporating our safety concepts can enhance the text guidance of the original prompt.
  • Figure 5: Gender fairness generation. From the prompt "a photo of a doctor", the original SD exhibits significant gender bias, as shown on the left side. Our approach with uniformly sampled gender vectors represents genders equally in the generated images.
  • ...and 11 more figures