Table of Contents
Fetching ...

Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering

Jiayi Luo, Jiayu Chen, Jiankun Wang, Cong Wang, Hanxin Zhu, Qingyun Sun, Chen Gao, Zhibo Chen, Jianxin Li

Abstract

Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, leading to the development of sparse attention technologies to improve efficiency. However, existing training-free sparse attention methods in video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight that the attention sparsity of each layer is its intrinsic property, with minor effects across different inputs. Motivated by this, we propose SVOO, a training-free Sparse attention framework for fast Video generation via Offline layer-wise sparsity profiling and Online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a novel bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to $1.93\times$ speedup while maintaining a PSNR of up to 29 dB on Wan2.1.

Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering

Abstract

Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, leading to the development of sparse attention technologies to improve efficiency. However, existing training-free sparse attention methods in video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight that the attention sparsity of each layer is its intrinsic property, with minor effects across different inputs. Motivated by this, we propose SVOO, a training-free Sparse attention framework for fast Video generation via Offline layer-wise sparsity profiling and Online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a novel bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to speedup while maintaining a PSNR of up to 29 dB on Wan2.1.
Paper Structure (41 sections, 2 theorems, 73 equations, 14 figures, 4 tables, 1 algorithm)

This paper contains 41 sections, 2 theorems, 73 equations, 14 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.2

Consider a well-trained transformer layer and denote by $V(\mathbf{X})$ the average row-wise variance of the pre-softmax attention logits produced by this layer for input $\mathbf{X}\in\mathbb{R}^{n\times d}$. Under Assumption assump:mean_bound, for any two independent inputs $\mathbf{X},\mathbf{\ha where $C>0$ is an absolute constant, $\mathbf{M}\triangleq \mathbf{W}_{\mathrm Q}\mathbf{W}_{\mathr

Figures (14)

  • Figure 1: An example acceleration comparison on the Wan2.1-T2V-1.3B wan2025wan model against a state-of-the-art training-free sparse attention method named SVG2 yang2025sparse. Our proposed SVOO achieves higher inference speedup (1.96$\times$ vs. 1.75$\times$) while still maintaining high generation quality in practice. All experiments are conducted on a single NVIDIA H200 GPU at a 720p resolution of 720$\times$1280 with 81 frames.
  • Figure 2: Layer-wise attention sparsity across different models. The figure shows that attention density varies substantially across layers (layer-wise heterogeneity), while remaining highly stable for each layer across different inputs (layer-wise stability).
  • Figure 3: Illustration of query-key coupling in block partitioning, which is important for block-wise sparse attention. The example shows that the optimal partitioning of Keys is Query-dependent, as different Queries induce different optimal groupings of Keys.
  • Figure 4: The framework of SVOO consists of two stages for accelerating video generation. Offline stage (left): we profile the intrinsic attention sparsity of each transformer layer and derive a layer-wise sparsity schedule. Online stage (right): we perform bidirectional co-clustering to partition queries and keys into coupled blocks, and then select salient block pairs according to the offline schedule.
  • Figure 5: Examples of videos generated by our proposed SVOO and dense attention on Wan and HunyuanVideo models.
  • ...and 9 more figures

Theorems & Definitions (3)

  • Theorem 4.2: Layer-wise Sparsity Stability
  • Theorem 1.1: Layer-wise Sparsity Stability
  • proof