Bidirectional Sparse Attention for Faster Video Diffusion Training
Chenlu Zhan, Wen Li, Chuyu Shen, Jun Zhang, Suhui Wu, Hao Zhang
TL;DR
This work tackles the heavy computational burden of long-sequence video diffusion transformers by introducing Bidirectional Sparse Attention (BSA), a trainable framework that sparsifies both Queries and KVs in 3D full attention. It uses a 3D block partitioning strategy, a query-sparsity method that preserves informative tokens via center-token cosine similarity (with windowing), and a dynamic, statistics-based KV-sparsity mechanism that adaptively selects salient KV blocks per query. The approach achieves up to 20× FLOP reduction and up to 17.79× faster attention during training, with additional training-free inferences yielding up to 6.2× speedups, while maintaining or surpassing full-attention generation quality. These results demonstrate substantial practical impact for efficient pretraining and deployment of long-duration, high-resolution video diffusion models.
Abstract
Video diffusion Transformer (DiT) models excel in generative quality but hit major computational bottlenecks when producing high-resolution, long-duration videos. The quadratic complexity of full attention leads to prohibitively high training and inference costs. Full attention inefficiency stems from two key challenges: excessive computation due to the inherent sparsity of Queries and Key-Value pairs, and redundant computation as fixed sparse patterns fail to leverage DiT's dynamic attention. To overcome this limitation, we propose a Bidirectional Sparse Attention (BSA) framework for faster video DiT training, the first to dynamically sparsify both Queries and Key-Value pairs within 3D full attention, thereby substantially improving training and inference efficiency. BSA addresses these issues through two key components. Query sparsity is optimized by selecting the most informative query tokens via semantic similarity and with a dynamic spatial-time training strategy, while KV sparsity is achieved by computing a statistical dynamic threshold to retain only the most salient KV blocks for computation. Extensive experiments demonstrate that BSA significantly accelerates DiT training across long sequences, reducing FLOPs by up to 20x and achieving 17.79x faster attention training, while preserving or even surpassing the generative quality of full attention.
