Table of Contents
Fetching ...

Bidirectional Sparse Attention for Faster Video Diffusion Training

Chenlu Zhan, Wen Li, Chuyu Shen, Jun Zhang, Suhui Wu, Hao Zhang

TL;DR

This work tackles the heavy computational burden of long-sequence video diffusion transformers by introducing Bidirectional Sparse Attention (BSA), a trainable framework that sparsifies both Queries and KVs in 3D full attention. It uses a 3D block partitioning strategy, a query-sparsity method that preserves informative tokens via center-token cosine similarity (with windowing), and a dynamic, statistics-based KV-sparsity mechanism that adaptively selects salient KV blocks per query. The approach achieves up to 20× FLOP reduction and up to 17.79× faster attention during training, with additional training-free inferences yielding up to 6.2× speedups, while maintaining or surpassing full-attention generation quality. These results demonstrate substantial practical impact for efficient pretraining and deployment of long-duration, high-resolution video diffusion models.

Abstract

Video diffusion Transformer (DiT) models excel in generative quality but hit major computational bottlenecks when producing high-resolution, long-duration videos. The quadratic complexity of full attention leads to prohibitively high training and inference costs. Full attention inefficiency stems from two key challenges: excessive computation due to the inherent sparsity of Queries and Key-Value pairs, and redundant computation as fixed sparse patterns fail to leverage DiT's dynamic attention. To overcome this limitation, we propose a Bidirectional Sparse Attention (BSA) framework for faster video DiT training, the first to dynamically sparsify both Queries and Key-Value pairs within 3D full attention, thereby substantially improving training and inference efficiency. BSA addresses these issues through two key components. Query sparsity is optimized by selecting the most informative query tokens via semantic similarity and with a dynamic spatial-time training strategy, while KV sparsity is achieved by computing a statistical dynamic threshold to retain only the most salient KV blocks for computation. Extensive experiments demonstrate that BSA significantly accelerates DiT training across long sequences, reducing FLOPs by up to 20x and achieving 17.79x faster attention training, while preserving or even surpassing the generative quality of full attention.

Bidirectional Sparse Attention for Faster Video Diffusion Training

TL;DR

This work tackles the heavy computational burden of long-sequence video diffusion transformers by introducing Bidirectional Sparse Attention (BSA), a trainable framework that sparsifies both Queries and KVs in 3D full attention. It uses a 3D block partitioning strategy, a query-sparsity method that preserves informative tokens via center-token cosine similarity (with windowing), and a dynamic, statistics-based KV-sparsity mechanism that adaptively selects salient KV blocks per query. The approach achieves up to 20× FLOP reduction and up to 17.79× faster attention during training, with additional training-free inferences yielding up to 6.2× speedups, while maintaining or surpassing full-attention generation quality. These results demonstrate substantial practical impact for efficient pretraining and deployment of long-duration, high-resolution video diffusion models.

Abstract

Video diffusion Transformer (DiT) models excel in generative quality but hit major computational bottlenecks when producing high-resolution, long-duration videos. The quadratic complexity of full attention leads to prohibitively high training and inference costs. Full attention inefficiency stems from two key challenges: excessive computation due to the inherent sparsity of Queries and Key-Value pairs, and redundant computation as fixed sparse patterns fail to leverage DiT's dynamic attention. To overcome this limitation, we propose a Bidirectional Sparse Attention (BSA) framework for faster video DiT training, the first to dynamically sparsify both Queries and Key-Value pairs within 3D full attention, thereby substantially improving training and inference efficiency. BSA addresses these issues through two key components. Query sparsity is optimized by selecting the most informative query tokens via semantic similarity and with a dynamic spatial-time training strategy, while KV sparsity is achieved by computing a statistical dynamic threshold to retain only the most salient KV blocks for computation. Extensive experiments demonstrate that BSA significantly accelerates DiT training across long sequences, reducing FLOPs by up to 20x and achieving 17.79x faster attention training, while preserving or even surpassing the generative quality of full attention.

Paper Structure

This paper contains 17 sections, 5 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Performance comparison between our sparse attention with full attention. (a) Speedup ratio and computational cost. (b) Generation quality across four consistency metrics on VBench huang2024vbench.
  • Figure 2: Distribution of important queries in attention computation. In full attention, the attention maps across different frames are highly similar, indicating the presence of redundant queries that lead to repetitive computations. In contrast, our Query-Sparse attention produces distinct attention maps that focus on salient content such as human actions, rather than static backgrounds. This demonstrates the method’s ability to prune redundant features while preserving essential semantics.
  • Figure 3: Overview of BSA. We introduce a Bidirectional Attention Sparsification that exploits the dynamic sparsity of both Queries and Key-Value (KV) pairs. The $QKV$ sequences of video are first partitioned into blocks to efficiently select critical tokens (Sect. \ref{['block_partition']}). (a) We then select each query block’s center token, linearly score within-block tokens by semantic similarity to the center, and prune a fixed fraction of redundant tokens to retain only the most informative queries (Sec. \ref{['Sparse Query']}). (b) For KV sparsity, we dynamically pick the most relevant KV blocks for each query block and prune the unrelated KV blocks. We compute sparsity-adaptive thresholds and iteratively admit tokens until a cumulative score target is met (Sec. \ref{['Sparse KV']}).
  • Figure 4: Comparison curves of (a) training loss and (b) validation loss for Sparse Attention and Full Attention.
  • Figure 5: Qualitative comparison of text-to-video generation results between full attention and BSA across 4 different sequence lengths.
  • ...and 3 more figures