Table of Contents
Fetching ...

LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

Dor Shmilovich, Tony Wu, Aviad Dahan, Yuval Domb

TL;DR

Diffusion Transformers suffer from quadratic attention, hindering video generation speed. LiteAttention exploits temporal sparsity coherence to propagate skip decisions across denoising steps, combining dynamic pattern adaptivity with static-efficiency benefits. It introduces an evolutionary skip framework, amortized sparsity profiling via a persistent Skip-Mask, and a GPU-optimized implementation atop FlashAttention3, with a calibration mechanism to control accumulated error. Empirical results on Wan2.1/2.2 models show substantial speedups over baselines while preserving visual fidelity, demonstrating a practical path to fast, high-quality diffusion-based video generation without retraining.

Abstract

Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically estimating sparse attention patterns at each denoising step incurs high computational overhead and estimation errors, while static sparsity patterns remain fixed and often suboptimal throughout denoising. We identify a key structural property of diffusion attention, namely, its sparsity patterns exhibit strong temporal coherence across denoising steps. Tiles deemed non-essential at step $t$ typically remain so at step $t+δ$. Leveraging this observation, we introduce LiteAttention, a method that exploits temporal coherence to enable evolutionary computation skips across the denoising sequence. By marking non-essential tiles early and propagating skip decisions forward, LiteAttention eliminates redundant attention computations without repeated profiling overheads, combining the adaptivity of dynamic methods with the efficiency of static ones. We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models, with no degradation in quality. The code and implementation details will be publicly released.

LiteAttention: A Temporal Sparse Attention for Diffusion Transformers

TL;DR

Diffusion Transformers suffer from quadratic attention, hindering video generation speed. LiteAttention exploits temporal sparsity coherence to propagate skip decisions across denoising steps, combining dynamic pattern adaptivity with static-efficiency benefits. It introduces an evolutionary skip framework, amortized sparsity profiling via a persistent Skip-Mask, and a GPU-optimized implementation atop FlashAttention3, with a calibration mechanism to control accumulated error. Empirical results on Wan2.1/2.2 models show substantial speedups over baselines while preserving visual fidelity, demonstrating a practical path to fast, high-quality diffusion-based video generation without retraining.

Abstract

Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically estimating sparse attention patterns at each denoising step incurs high computational overhead and estimation errors, while static sparsity patterns remain fixed and often suboptimal throughout denoising. We identify a key structural property of diffusion attention, namely, its sparsity patterns exhibit strong temporal coherence across denoising steps. Tiles deemed non-essential at step typically remain so at step . Leveraging this observation, we introduce LiteAttention, a method that exploits temporal coherence to enable evolutionary computation skips across the denoising sequence. By marking non-essential tiles early and propagating skip decisions forward, LiteAttention eliminates redundant attention computations without repeated profiling overheads, combining the adaptivity of dynamic methods with the efficiency of static ones. We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models, with no degradation in quality. The code and implementation details will be publicly released.

Paper Structure

This paper contains 23 sections, 14 equations, 4 figures, 3 tables, 2 algorithms.

Figures (4)

  • Figure 1: A graphical depiction of the Skip-Mask update step in Algorithm \ref{['alg:qk_skip']}.
  • Figure 2: Toy run of FlashAttention vs. $\tt LiteAttention$ within a video diffusion model for a varying number of video frames. Left: provides the runtimes. Right: provides $\tt LiteAttention$'s sparsity. If we assume that FlashAttention is of quadratic complexity, then this suggests that $\tt LiteAttention$ is of lower complexity, otherwise we would expect the sparsity percentage to be constant and not increasing.
  • Figure 3: $\tt LiteAttention$'s pipeline for the two warpgroup H100 configuration (based on FA3). In SkipLogic-1, a skip bit is computed per each warp in the warpgroup. In SkipLogic-2, a skip bit is again computed per warp and the result is combined with the bitmap of warpgroup 1. In SkipLogic-3 the warp-level skip bitmap is reduced to a single skip bit per the complete tile.
  • Figure 4: The evolving Skip-Mask across diffusion timesteps for LTX-13B HaCohen2024LTXVideo over two block/head sets. The top and bottom are the start and end masks, respectively. Dark purple means skipped.