Table of Contents
Fetching ...

Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention

Chengtao Lv, Yumeng Shi, Yushi Huang, Ruihao Gong, Shen Ren, Wenya Wang

TL;DR

This paper tackles the quadratic cost of spatiotemporal attention in autoregressive (AR) video diffusion by introducing Light Forcing, a sparse-attention framework specifically designed for AR generation. It presents Chunk-Aware Growth (CAG) to allocate sparsity across chunks based on estimated contribution to global error, and Hierarchical Sparse Attention (HSA) to capture both long-range and local historical context through a coarse-to-fine frame-and-block masking strategy. Together, CAG and HSA deliver improved generation quality and fixed-complexity attention, enabling real-time video synthesis (e.g., 19.7 FPS on a consumer RTX 5090) when paired with FP8 quantization and LightVAE. The approach outperforms state-of-the-art sparse-attention baselines on VBench across Self Forcing and LongLive, demonstrating practical impact for scalable AR video diffusion and interactive applications.

Abstract

Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textsc{Light Forcing}, the \textit{first} sparse attention solution tailored for AR video generation models. It incorporates a \textit{Chunk-Aware Growth} mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a \textit{Hierarchical Sparse Attention} to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (\ie, frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (\eg, 84.5 on VBench) and efficiency (\eg, $1.2{\sim}1.3\times$ end-to-end speedup). Combined with FP8 quantization and LightVAE, \textsc{Light Forcing} further achieves a $2.3\times$ speedup and 19.7\,FPS on an RTX~5090 GPU. Code will be released at \href{https://github.com/chengtao-lv/LightForcing}{https://github.com/chengtao-lv/LightForcing}.

Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention

TL;DR

This paper tackles the quadratic cost of spatiotemporal attention in autoregressive (AR) video diffusion by introducing Light Forcing, a sparse-attention framework specifically designed for AR generation. It presents Chunk-Aware Growth (CAG) to allocate sparsity across chunks based on estimated contribution to global error, and Hierarchical Sparse Attention (HSA) to capture both long-range and local historical context through a coarse-to-fine frame-and-block masking strategy. Together, CAG and HSA deliver improved generation quality and fixed-complexity attention, enabling real-time video synthesis (e.g., 19.7 FPS on a consumer RTX 5090) when paired with FP8 quantization and LightVAE. The approach outperforms state-of-the-art sparse-attention baselines on VBench across Self Forcing and LongLive, demonstrating practical impact for scalable AR video diffusion and interactive applications.

Abstract

Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textsc{Light Forcing}, the \textit{first} sparse attention solution tailored for AR video generation models. It incorporates a \textit{Chunk-Aware Growth} mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a \textit{Hierarchical Sparse Attention} to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (\ie, frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (\eg, 84.5 on VBench) and efficiency (\eg, end-to-end speedup). Combined with FP8 quantization and LightVAE, \textsc{Light Forcing} further achieves a speedup and 19.7\,FPS on an RTX~5090 GPU. Code will be released at \href{https://github.com/chengtao-lv/LightForcing}{https://github.com/chengtao-lv/LightForcing}.
Paper Structure (19 sections, 22 equations, 7 figures, 5 tables)

This paper contains 19 sections, 22 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Runtime comparison of attention versus other components across chunk indices for Self Forcing huang2025self 1.3B on RTX 5090. When the chunk index reaches 14, attention accounts for approximately $\sim$75% of the total latency.
  • Figure 2: Comparison of different visual generation examples (i.e., 7 chunks for 21 latent frames), where blue, red, and green boxes denote attention sparsity rates of 0%, 80%, and 90%, respectively.
  • Figure 3: Overview of Light Forcing. The left subfigure illustrates our Chunk-Aware Growth (Sec. \ref{['sec:method1']}) strategy for sparsity allocation across different chunks. The right subfigure demonstrates how Hierarchical Sparse Attention (Sec. \ref{['sec:method2']}) is utilized to efficiently retrieve long-range historical context. Note that a chunk corresponds to a group of frames processed in a single generation (e.g., 3 frames in practice). For simplicity, we visualize each chunk as a single frame in the overview.
  • Figure 4: Visualization of attention logits between query blocks at chunk 7 (i.e., frame 18-20, 24 blocks per frame) and all past key frames (i.e., frame 0-17) on Self Forcing huang2025self.
  • Figure 5: Qualitative comparisons of 5-second videos generated under the prompt "A cute raccoon playing guitar in a boat on the ocean" on Self Forcing huang2025self. We select frames at 0s, 2s, and 5s as representative snapshots of the video.
  • ...and 2 more figures