Table of Contents
Fetching ...

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Zhihang Yuan, Hanling Zhang, Pu Lu, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, Yu Wang

TL;DR

Diffusion transformers suffer from quadratic self-attention costs at high resolutions. The authors introduce DiTFastAttn, a post-training compression framework that identifies spatial, temporal, and conditional redundancies and tackles them with Window Attention with Residual Sharing, Attention Sharing across Timesteps, and Attention Sharing across CFG, guided by a greedy plan optimizer. The approach yields substantial attention-FLOPs reductions and up to 1.8x end-to-end speedups on high-resolution image generation, and meaningful gains for video generation, while maintaining generation quality. This work offers a practical, training-free route to accelerate diffusion transformers and highlights the need for per-layer, per-step compression planning. The methods are complementary to quantization and distillation, broadening the toolbox for efficient diffusion-based generation.

Abstract

Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. We identify three key redundancies in the attention computation during DiT inference: (1) spatial redundancy, where many attention heads focus on local information; (2) temporal redundancy, with high similarity between the attention outputs of neighboring steps; (3) conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. We propose three techniques to reduce these redundancies: (1) Window Attention with Residual Sharing to reduce spatial redundancy; (2) Attention Sharing across Timesteps to exploit the similarity between steps; (3) Attention Sharing across CFG to skip redundant computations during conditional generation. We apply DiTFastAttn to DiT, PixArt-Sigma for image generation tasks, and OpenSora for video generation tasks. Our results show that for image generation, our method reduces up to 76% of the attention FLOPs and achieves up to 1.8x end-to-end speedup at high-resolution (2k x 2k) generation.

DiTFastAttn: Attention Compression for Diffusion Transformer Models

TL;DR

Diffusion transformers suffer from quadratic self-attention costs at high resolutions. The authors introduce DiTFastAttn, a post-training compression framework that identifies spatial, temporal, and conditional redundancies and tackles them with Window Attention with Residual Sharing, Attention Sharing across Timesteps, and Attention Sharing across CFG, guided by a greedy plan optimizer. The approach yields substantial attention-FLOPs reductions and up to 1.8x end-to-end speedups on high-resolution image generation, and meaningful gains for video generation, while maintaining generation quality. This work offers a practical, training-free route to accelerate diffusion transformers and highlights the need for per-layer, per-step compression planning. The methods are complementary to quantization and distillation, broadening the toolbox for efficient diffusion-based generation.

Abstract

Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. We identify three key redundancies in the attention computation during DiT inference: (1) spatial redundancy, where many attention heads focus on local information; (2) temporal redundancy, with high similarity between the attention outputs of neighboring steps; (3) conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. We propose three techniques to reduce these redundancies: (1) Window Attention with Residual Sharing to reduce spatial redundancy; (2) Attention Sharing across Timesteps to exploit the similarity between steps; (3) Attention Sharing across CFG to skip redundant computations during conditional generation. We apply DiTFastAttn to DiT, PixArt-Sigma for image generation tasks, and OpenSora for video generation tasks. Our results show that for image generation, our method reduces up to 76% of the attention FLOPs and achieves up to 1.8x end-to-end speedup at high-resolution (2k x 2k) generation.
Paper Structure (29 sections, 3 equations, 14 figures, 6 tables, 1 algorithm)

This paper contains 29 sections, 3 equations, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: Left: The efficiency benefits of applying DiTFastAttn on PixArt-Sigma chen2024pixart when generating images of different resolutions. The Y-axis shows the #FLOPs fraction normalized by the #FLOPs of the original model. Right: The qualitative results of applying DiTFastAttn on 1024$\times$1024 PixArt-Sigma.
  • Figure 2: Types of redundancy and corresponding compression techniques. Left: Redundancy in the spatial dimension, denoising steps, and CFG. Right: Techniques implemented in DiTFastAttn to reduce redundancy for each type. DiTFastAttn employs window attention to minimize attention redundancy, while maintaining performance using residuals. Additionally, attention outputs are shared both step-wise and CFG-wise to reduce redundancy.
  • Figure 3: Window Attention with Residual Sharing. (a) Left: Example of the attention map showing the window pattern. Right: The MSE between the window attention outputs in the previous and current step (yellow line) versus the MSE between the output residuals of window and full attention in the previous and current step (blue line). The output residual exhibits minimal changes over the steps. (b) Computation of Window Attention with Residual Sharing. Window attention that illustrates significant changes is recalculated. Residuals that change minimally are cached and reused in subsequent steps.
  • Figure 4: Similarity of Attention Outputs Across Step and CFG Dimensions in DiT. (a) Similarity of attention outputs across step dimension in different layers. (b) Similarity between conditional and unconditional attention outputs in various layers at different steps
  • Figure 5: Compression plan for DiT-XL-512, PixArt-Sigma-XL-1024 and PixArt-Sigma-XL-2K at D6 with the number of DPM-Solver steps set to 50.
  • ...and 9 more figures