LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

Dor Shmilovich; Tony Wu; Aviad Dahan; Yuval Domb

LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

Dor Shmilovich, Tony Wu, Aviad Dahan, Yuval Domb

TL;DR

Diffusion Transformers suffer from quadratic attention, hindering video generation speed. LiteAttention exploits temporal sparsity coherence to propagate skip decisions across denoising steps, combining dynamic pattern adaptivity with static-efficiency benefits. It introduces an evolutionary skip framework, amortized sparsity profiling via a persistent Skip-Mask, and a GPU-optimized implementation atop FlashAttention3, with a calibration mechanism to control accumulated error. Empirical results on Wan2.1/2.2 models show substantial speedups over baselines while preserving visual fidelity, demonstrating a practical path to fast, high-quality diffusion-based video generation without retraining.

Abstract

Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically estimating sparse attention patterns at each denoising step incurs high computational overhead and estimation errors, while static sparsity patterns remain fixed and often suboptimal throughout denoising. We identify a key structural property of diffusion attention, namely, its sparsity patterns exhibit strong temporal coherence across denoising steps. Tiles deemed non-essential at step $t$ typically remain so at step $t+δ$. Leveraging this observation, we introduce LiteAttention, a method that exploits temporal coherence to enable evolutionary computation skips across the denoising sequence. By marking non-essential tiles early and propagating skip decisions forward, LiteAttention eliminates redundant attention computations without repeated profiling overheads, combining the adaptivity of dynamic methods with the efficiency of static ones. We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models, with no degradation in quality. The code and implementation details will be publicly released.

LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

TL;DR

Abstract

LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)