DiTFastAttn: Attention Compression for Diffusion Transformer Models
Zhihang Yuan, Hanling Zhang, Pu Lu, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, Yu Wang
TL;DR
Diffusion transformers suffer from quadratic self-attention costs at high resolutions. The authors introduce DiTFastAttn, a post-training compression framework that identifies spatial, temporal, and conditional redundancies and tackles them with Window Attention with Residual Sharing, Attention Sharing across Timesteps, and Attention Sharing across CFG, guided by a greedy plan optimizer. The approach yields substantial attention-FLOPs reductions and up to 1.8x end-to-end speedups on high-resolution image generation, and meaningful gains for video generation, while maintaining generation quality. This work offers a practical, training-free route to accelerate diffusion transformers and highlights the need for per-layer, per-step compression planning. The methods are complementary to quantization and distillation, broadening the toolbox for efficient diffusion-based generation.
Abstract
Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. We identify three key redundancies in the attention computation during DiT inference: (1) spatial redundancy, where many attention heads focus on local information; (2) temporal redundancy, with high similarity between the attention outputs of neighboring steps; (3) conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. We propose three techniques to reduce these redundancies: (1) Window Attention with Residual Sharing to reduce spatial redundancy; (2) Attention Sharing across Timesteps to exploit the similarity between steps; (3) Attention Sharing across CFG to skip redundant computations during conditional generation. We apply DiTFastAttn to DiT, PixArt-Sigma for image generation tasks, and OpenSora for video generation tasks. Our results show that for image generation, our method reduces up to 76% of the attention FLOPs and achieves up to 1.8x end-to-end speedup at high-resolution (2k x 2k) generation.
