Table of Contents
Fetching ...

Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation

Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han

TL;DR

The paper tackles the heavy computational burden of 3D attention in diffusion-based video generation by identifying Spatiotemporal Energy Decay and introducing Radial Attention, a static $O(n \log n)$ sparse pattern that preserves essential spatiotemporal interactions. By mapping energy decay to compute density, it constructs a 4D attention mask with dense central regions and progressively sparser outer bands, achieving substantial speedups while maintaining video fidelity across multiple backbones. It further enables efficient long-video generation via LoRA-based fine-tuning and demonstrates up to 4× longer video generation with reduced training costs and faster inference. The approach outperforms strong sparse baselines (SVG, STA, PA) in quality metrics and provides a practical pathway to scalable, high-fidelity long-video diffusion. The work also includes theoretical error bounds and ablations validating design choices, with open-source code to facilitate adoption.

Abstract

Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with $\mathcal{O}(n \log n)$ complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard $\mathcal{O}(n^2)$ dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$\times$ speedup over the original dense attention. With minimal tuning, it enables video generation up to 4$\times$ longer while reducing training costs by up to 4.4$\times$ compared to direct fine-tuning and accelerating inference by up to 3.7$\times$ compared to dense attention inference. Code is released at \href{https://github.com/mit-han-lab/radial-attention}{https://github.com/mit-han-lab/radial-attention}.

Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation

TL;DR

The paper tackles the heavy computational burden of 3D attention in diffusion-based video generation by identifying Spatiotemporal Energy Decay and introducing Radial Attention, a static sparse pattern that preserves essential spatiotemporal interactions. By mapping energy decay to compute density, it constructs a 4D attention mask with dense central regions and progressively sparser outer bands, achieving substantial speedups while maintaining video fidelity across multiple backbones. It further enables efficient long-video generation via LoRA-based fine-tuning and demonstrates up to 4× longer video generation with reduced training costs and faster inference. The approach outperforms strong sparse baselines (SVG, STA, PA) in quality metrics and provides a practical pathway to scalable, high-fidelity long-video diffusion. The work also includes theoretical error bounds and ablations validating design choices, with open-source code to facilitate adoption.

Abstract

Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9 speedup over the original dense attention. With minimal tuning, it enables video generation up to 4 longer while reducing training costs by up to 4.4 compared to direct fine-tuning and accelerating inference by up to 3.7 compared to dense attention inference. Code is released at \href{https://github.com/mit-han-lab/radial-attention}{https://github.com/mit-han-lab/radial-attention}.

Paper Structure

This paper contains 30 sections, 22 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: We present Radial Attention, a sparse attention mechanism with $\mathcal{O}(n \log n)$ computational complexity. Radial Attention accelerates pre-trained HunyuanVideo kong2024hunyuanvideo by 1.9× at its default video length while maintaining comparable video quality. When generating 4× longer videos, it reduces tuning costs by up to 4.4× and speeds up inference by up to 3.7× versus dense attention.
  • Figure 2: Radial Attention reduces the computational complexity of attention from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$. When generating a 509-frame 720p video with HunyuanVideo, it reduces the attention computation by 9×, achieves 3.7× speedup, and saves 4.4× tuning costs.
  • Figure 3: Attention pipelines of SVG xi2025sparse and our Radial Attention. Softmax is omitted for clarity. (a) SVG dynamically selects either a spatial or temporal attention for each head to speed up inference. However, it does not overcome the original model's length limitation and cannot be trained on unseen distributions like longer videos. (b) Our Radial Attention uses a static mask that unifies spatial and temporal attention with $\mathcal{O}(n \log n)$ computational complexity. This static design enables efficient longer-video adaptation.
  • Figure 4: (a) Example spatial and temporal attention maps from HunyuanVideo (defined in Section \ref{['sect:Spatiotemporal Energy Decay in Attention']}). (b) Attention score distributions. (b1): Average score between tokens at the same spatial location decreases with temporal distance (b2): Average attention score within a frame decreases with spatial distance. Spatial and Temporal Attention refer to the distributions derived from the corresponding maps in (a). Average means averaging over multiple random maps and diffusion steps. The plots indicate that spatial attention shows a high temporal decay and relatively low spatial decay, while temporal attention exhibits the opposite.
  • Figure 5: (a) The compute density pattern. The attention map is divided into $2\lceil\log_2(\max(f, 2))\rceil - 1$ bands (here, the number of frames $f = 12$) based on the temporal distance between tokens. The central band has full compute density, while each successive outer band has half the density of the previous one. Except for band $\pm1$, each band also doubles the diagonal width of its predecessor. (b) The corresponding attention mask for (a). The compute density is reflected in the compute diagonal width of each frame-to-frame block. When the diagonal width drops below 1, we reduce the frequency of diagonals. We additionally add an attention sink. (c) An example mask used in HunyuanVideo, illustrating the final sparsity pattern in practice.
  • ...and 9 more figures