Table of Contents
Fetching ...

Attention in Diffusion Model: A Survey

Litao Hua, Fan Liu, Jie Su, Xingyu Miao, Zizhou Ouyang, Zeyu Wang, Runze Hu, Zhenyu Wen, Bing Zhai, Yang Long, Haoran Duan, Yuan Zhou

TL;DR

Attention mechanisms are foundational in diffusion models and influence both generative and discriminative tasks. This survey delivers a unified taxonomy of attention modifications that operate on different components of diffusion architectures, and maps these techniques to a broad set of unimodal and multimodal tasks. It reviews architectural innovations, performance benefits, and practical applications, and identifies limitations and underexplored directions for future work. By clarifying how attention interfaces with diffusion dynamics, the paper provides a roadmap for designing more controllable, efficient, and interpretable diffusion-based systems.

Abstract

Attention mechanisms have become a foundational component in diffusion models, significantly influencing their capacity across a wide range of generative and discriminative tasks. This paper presents a comprehensive survey of attention within diffusion models, systematically analysing its roles, design patterns, and operations across different modalities and tasks. We propose a unified taxonomy that categorises attention-related modifications into parts according to the structural components they affect, offering a clear lens through which to understand their functional diversity. In addition to reviewing architectural innovations, we examine how attention mechanisms contribute to performance improvements in diverse applications. We also identify current limitations and underexplored areas, and outline potential directions for future research. Our study provides valuable insights into the evolving landscape of diffusion models, with a particular focus on the integrative and ubiquitous role of attention.

Attention in Diffusion Model: A Survey

TL;DR

Attention mechanisms are foundational in diffusion models and influence both generative and discriminative tasks. This survey delivers a unified taxonomy of attention modifications that operate on different components of diffusion architectures, and maps these techniques to a broad set of unimodal and multimodal tasks. It reviews architectural innovations, performance benefits, and practical applications, and identifies limitations and underexplored directions for future work. By clarifying how attention interfaces with diffusion dynamics, the paper provides a roadmap for designing more controllable, efficient, and interpretable diffusion-based systems.

Abstract

Attention mechanisms have become a foundational component in diffusion models, significantly influencing their capacity across a wide range of generative and discriminative tasks. This paper presents a comprehensive survey of attention within diffusion models, systematically analysing its roles, design patterns, and operations across different modalities and tasks. We propose a unified taxonomy that categorises attention-related modifications into parts according to the structural components they affect, offering a clear lens through which to understand their functional diversity. In addition to reviewing architectural innovations, we examine how attention mechanisms contribute to performance improvements in diverse applications. We also identify current limitations and underexplored areas, and outline potential directions for future research. Our study provides valuable insights into the evolving landscape of diffusion models, with a particular focus on the integrative and ubiquitous role of attention.

Paper Structure

This paper contains 47 sections, 6 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: A typical pipeline of diffusion models, highlighting the attention mechanism for clarity. The pipeline consists of two stages: diffusion and denoising. Initially, the original image $x$ is encoded and gradually noised into $z_T$. Then, starting from $z_T$, the denoising U-Net, utilizing both cross-attention and self-attention, removes noise and reconstructs the image $x'$. Notably, the attention blocks within U-Net are presented in detail, illustrating how cross-attention and self-attention are implemented and interact. This detailed representation is crucial for understanding the model's internal workings, especially regarding the attention mechanisms.
  • Figure 2: An illustration of the method to identify components of attention in diffusion model. $W_q$, $W_k$ and $W_v$ represent weight matrix for the query, key and value, respectively. $x$ stands for the input and $d$ is the scaling factor. We categorized the attention modifications into 5 levels based on the changes made to different components of attention. In each level, the modified parts are highlighted in black, while the unmodified parts are shown in gray.
  • Figure 3: The timeline of the development of attention related methods and diffusion models. The boxes indicate representative works. The boxes marked with a smile symbol represent the foundation models in this field.
  • Figure 4: Taxonomy of attention methods in diffusion models.
  • Figure 5: An illustration of a typical architecture of consistency enhancement. The left side of the figure illustrates the consistency issue, while the right side shows the method of modifying attention to maintain consistency. $Q_s$, $K_s$ and $V_s$ originate from the source image or text. $Q_t$, $K_t$ and $V_t$ come from the target image or text. The modulated components of attention are highlighted with red boxes.
  • ...and 3 more figures