Table of Contents
Fetching ...

Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing

Bingyan Liu, Chengyu Wang, Tingfeng Cao, Kui Jia, Jun Huang

TL;DR

An in-depth probing analysis is conducted and it is demonstrated that cross-attention maps in Stable Diffusion often contain object attribution information, which can result in editing failures, and proposes a simplified, yet more stable and efficient, tuning-free procedure that modifies only the self-attention maps of specified attention layers during the denoising process.

Abstract

Deep Text-to-Image Synthesis (TIS) models such as Stable Diffusion have recently gained significant popularity for creative Text-to-image generation. Yet, for domain-specific scenarios, tuning-free Text-guided Image Editing (TIE) is of greater importance for application developers, which modify objects or object properties in images by manipulating feature components in attention layers during the generation process. However, little is known about what semantic meanings these attention layers have learned and which parts of the attention maps contribute to the success of image editing. In this paper, we conduct an in-depth probing analysis and demonstrate that cross-attention maps in Stable Diffusion often contain object attribution information that can result in editing failures. In contrast, self-attention maps play a crucial role in preserving the geometric and shape details of the source image during the transformation to the target image. Our analysis offers valuable insights into understanding cross and self-attention maps in diffusion models. Moreover, based on our findings, we simplify popular image editing methods and propose a more straightforward yet more stable and efficient tuning-free procedure that only modifies self-attention maps of the specified attention layers during the denoising process. Experimental results show that our simplified method consistently surpasses the performance of popular approaches on multiple datasets.

Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing

TL;DR

An in-depth probing analysis is conducted and it is demonstrated that cross-attention maps in Stable Diffusion often contain object attribution information, which can result in editing failures, and proposes a simplified, yet more stable and efficient, tuning-free procedure that modifies only the self-attention maps of specified attention layers during the denoising process.

Abstract

Deep Text-to-Image Synthesis (TIS) models such as Stable Diffusion have recently gained significant popularity for creative Text-to-image generation. Yet, for domain-specific scenarios, tuning-free Text-guided Image Editing (TIE) is of greater importance for application developers, which modify objects or object properties in images by manipulating feature components in attention layers during the generation process. However, little is known about what semantic meanings these attention layers have learned and which parts of the attention maps contribute to the success of image editing. In this paper, we conduct an in-depth probing analysis and demonstrate that cross-attention maps in Stable Diffusion often contain object attribution information that can result in editing failures. In contrast, self-attention maps play a crucial role in preserving the geometric and shape details of the source image during the transformation to the target image. Our analysis offers valuable insights into understanding cross and self-attention maps in diffusion models. Moreover, based on our findings, we simplify popular image editing methods and propose a more straightforward yet more stable and efficient tuning-free procedure that only modifies self-attention maps of the specified attention layers during the denoising process. Experimental results show that our simplified method consistently surpasses the performance of popular approaches on multiple datasets.
Paper Structure (26 sections, 4 equations, 12 figures, 9 tables, 3 algorithms)

This paper contains 26 sections, 4 equations, 12 figures, 9 tables, 3 algorithms.

Figures (12)

  • Figure 1: An example showing that our method can perform more consistent and realistic TIE compared to P2P p2p.
  • Figure 2: Cross and self-attention layers in Stable Diffusion.
  • Figure 3: The heatmaps of cross-attention and self-attention maps in a generated image with the prompt "a white horse in the park". The visualization of the cross-attention map corresponds to each word in the prompt. The visualization of the self-attention map is the top-6 components obtained after SVD SVD.
  • Figure 4: Results of cross-attention and self-attention map replacements in difference layers of the diffusion model.
  • Figure 5: Editing results on replacing attention maps of different tokens in a prompt. "-" is a minus sign. - "a" represents subtracting the cross-attention map corresponding to "a".
  • ...and 7 more figures