Table of Contents
Fetching ...

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Lianghui Zhu, Zilong Huang, Bencheng Liao, Jun Hao Liew, Hanshu Yan, Jiashi Feng, Xinggang Wang

TL;DR

<3-5 sentence high-level summary> The paper addresses the inefficiency of quadratic attention in diffusion-model backbones by introducing Diffusion Gated Linear Attention (DiG), a sub-quadratic backbone built on Gated Linear Attention (GLA). It adds a lightweight Spatial Reorient & Enhancement Module (SREM) and a DiG block to enable efficient block-wise scanning and local context awareness, with two variants: plain DiG and U-DiG. Empirical results show DiG achieves competitive image quality on ImageNet 256x256 while significantly reducing training time and GPU memory, and scales favorably to higher resolutions (512–2048) compared to DiT, Mamba, and FlashAttention-2 baselines. The work positions DiG as a scalable, efficient backbone for long-sequence diffusion tasks and suggests potential extensions to broader modalities like video and audio.

Abstract

Diffusion models with large-scale pre-training have achieved significant success in the field of visual content generation, particularly exemplified by Diffusion Transformers (DiT). However, DiT models have faced challenges with quadratic complexity efficiency, especially when handling long sequences. In this paper, we aim to incorporate the sub-quadratic modeling capability of Gated Linear Attention (GLA) into the 2D diffusion backbone. Specifically, we introduce Diffusion Gated Linear Attention Transformers (DiG), a simple, adoptable solution with minimal parameter overhead. We offer two variants, i,e, a plain and U-shape architecture, showing superior efficiency and competitive effectiveness. In addition to superior performance to DiT and other sub-quadratic-time diffusion models at $256 \times 256$ resolution, DiG demonstrates greater efficiency than these methods starting from a $512$ resolution. Specifically, DiG-S/2 is $2.5\times$ faster and saves $75.7\%$ GPU memory compared to DiT-S/2 at a $1792$ resolution. Additionally, DiG-XL/2 is $4.2\times$ faster than the Mamba-based model at a $1024$ resolution and $1.8\times$ faster than DiT with FlashAttention-2 at a $2048$ resolution. We will release the code soon. Code is released at https://github.com/hustvl/DiG.

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

TL;DR

<3-5 sentence high-level summary> The paper addresses the inefficiency of quadratic attention in diffusion-model backbones by introducing Diffusion Gated Linear Attention (DiG), a sub-quadratic backbone built on Gated Linear Attention (GLA). It adds a lightweight Spatial Reorient & Enhancement Module (SREM) and a DiG block to enable efficient block-wise scanning and local context awareness, with two variants: plain DiG and U-DiG. Empirical results show DiG achieves competitive image quality on ImageNet 256x256 while significantly reducing training time and GPU memory, and scales favorably to higher resolutions (512–2048) compared to DiT, Mamba, and FlashAttention-2 baselines. The work positions DiG as a scalable, efficient backbone for long-sequence diffusion tasks and suggests potential extensions to broader modalities like video and audio.

Abstract

Diffusion models with large-scale pre-training have achieved significant success in the field of visual content generation, particularly exemplified by Diffusion Transformers (DiT). However, DiT models have faced challenges with quadratic complexity efficiency, especially when handling long sequences. In this paper, we aim to incorporate the sub-quadratic modeling capability of Gated Linear Attention (GLA) into the 2D diffusion backbone. Specifically, we introduce Diffusion Gated Linear Attention Transformers (DiG), a simple, adoptable solution with minimal parameter overhead. We offer two variants, i,e, a plain and U-shape architecture, showing superior efficiency and competitive effectiveness. In addition to superior performance to DiT and other sub-quadratic-time diffusion models at resolution, DiG demonstrates greater efficiency than these methods starting from a resolution. Specifically, DiG-S/2 is faster and saves GPU memory compared to DiT-S/2 at a resolution. Additionally, DiG-XL/2 is faster than the Mamba-based model at a resolution and faster than DiT with FlashAttention-2 at a resolution. We will release the code soon. Code is released at https://github.com/hustvl/DiG.
Paper Structure (25 sections, 15 equations, 18 figures, 6 tables, 3 algorithms)

This paper contains 25 sections, 15 equations, 18 figures, 6 tables, 3 algorithms.

Figures (18)

  • Figure 1: Efficiency comparison among DiT peebles2023dit with Attention vaswani2017attention, DiS fei2024dis with Mamba gu2023mamba, and our DiG model. DiG achieves higher training speed while costs lower GPU memory in dealing with high-resolution images. For example, DiG is $2.5\times$ faster than DiT and saves $75.7\%$ GPU memory with a resolution of $1792 \times 1792$, i.e., 12544 tokens per image. Patch size for all models is 2.
  • Figure 2: FPS comparison among DiS fei2024dis with Mamba gu2023mamba, DiT peebles2023dit with Attention vaswani2017attention, DiT with Flash Attention-2 (Flash-DiT) dao2023flashattention2 and our DiG model varying from different model sizes. We take DiG as a baseline. With a resolution of $1024 \times 1024$, DiG is $2.0\times$ faster than DiS at small size while $4.2\times$ faster at XL size. Furthermore, DiG-XL/2 is $1.8\times$ faster than the most well-designed high-optimized Flash-DiT-XL/2 with a resolution of $2048 \times 2048$.
  • Figure 3: Pipeline of GLA.
  • Figure 4: The overview of the proposed DiG models. The figure presents the (a) plain DiG, denoted as DiG, (b) U-shape DiG, denoted as U-DiG, (c) DiG block, and (d) block-by-block scanning directions of DiG controlled by the SREM.
  • Figure 5: Details of the Spatial Reorient & Enhancement Module (SREM).
  • ...and 13 more figures