Table of Contents
Fetching ...

Training-free and Adaptive Sparse Attention for Efficient Long Video Generation

Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng Xiao, Bin Cui

TL;DR

The paper tackles the computational burden of attention in diffusion-based long video generation using Diffusion Transformers. It introduces AdaSpa, a training-free, plug-and-play sparse attention framework that combines a Dynamic Pattern (blockified sparsity) with Online Precise Search (LSE-cached, head-adaptive) to accelerate generation. Through comprehensive analysis of DiT sparsity, AdaSpa demonstrates consistent quality preservation while achieving notable speedups across multiple models and scales, including up to 4× speedups with longer videos. The approach requires no dataset-specific profiling or fine-tuning, offering a scalable solution for efficient long video synthesis in real-world settings.

Abstract

Generating high-fidelity long videos with Diffusion Transformers (DiTs) is often hindered by significant latency, primarily due to the computational demands of attention mechanisms. For instance, generating an 8-second 720p video (110K tokens) with HunyuanVideo takes about 600 PFLOPs, with around 500 PFLOPs consumed by attention computations. To address this issue, we propose AdaSpa, the first Dynamic Pattern and Online Precise Search sparse attention method. Firstly, to realize the Dynamic Pattern, we introduce a blockified pattern to efficiently capture the hierarchical sparsity inherent in DiTs. This is based on our observation that sparse characteristics of DiTs exhibit hierarchical and blockified structures between and within different modalities. This blockified approach significantly reduces the complexity of attention computation while maintaining high fidelity in the generated videos. Secondly, to enable Online Precise Search, we propose the Fused LSE-Cached Search with Head-adaptive Hierarchical Block Sparse Attention. This method is motivated by our finding that DiTs' sparse pattern and LSE vary w.r.t. inputs, layers, and heads, but remain invariant across denoising steps. By leveraging this invariance across denoising steps, it adapts to the dynamic nature of DiTs and allows for precise, real-time identification of sparse indices with minimal overhead. AdaSpa is implemented as an adaptive, plug-and-play solution and can be integrated seamlessly with existing DiTs, requiring neither additional fine-tuning nor a dataset-dependent profiling. Extensive experiments validate that AdaSpa delivers substantial acceleration across various models while preserving video quality, establishing itself as a robust and scalable approach to efficient video generation.

Training-free and Adaptive Sparse Attention for Efficient Long Video Generation

TL;DR

The paper tackles the computational burden of attention in diffusion-based long video generation using Diffusion Transformers. It introduces AdaSpa, a training-free, plug-and-play sparse attention framework that combines a Dynamic Pattern (blockified sparsity) with Online Precise Search (LSE-cached, head-adaptive) to accelerate generation. Through comprehensive analysis of DiT sparsity, AdaSpa demonstrates consistent quality preservation while achieving notable speedups across multiple models and scales, including up to 4× speedups with longer videos. The approach requires no dataset-specific profiling or fine-tuning, offering a scalable solution for efficient long video synthesis in real-world settings.

Abstract

Generating high-fidelity long videos with Diffusion Transformers (DiTs) is often hindered by significant latency, primarily due to the computational demands of attention mechanisms. For instance, generating an 8-second 720p video (110K tokens) with HunyuanVideo takes about 600 PFLOPs, with around 500 PFLOPs consumed by attention computations. To address this issue, we propose AdaSpa, the first Dynamic Pattern and Online Precise Search sparse attention method. Firstly, to realize the Dynamic Pattern, we introduce a blockified pattern to efficiently capture the hierarchical sparsity inherent in DiTs. This is based on our observation that sparse characteristics of DiTs exhibit hierarchical and blockified structures between and within different modalities. This blockified approach significantly reduces the complexity of attention computation while maintaining high fidelity in the generated videos. Secondly, to enable Online Precise Search, we propose the Fused LSE-Cached Search with Head-adaptive Hierarchical Block Sparse Attention. This method is motivated by our finding that DiTs' sparse pattern and LSE vary w.r.t. inputs, layers, and heads, but remain invariant across denoising steps. By leveraging this invariance across denoising steps, it adapts to the dynamic nature of DiTs and allows for precise, real-time identification of sparse indices with minimal overhead. AdaSpa is implemented as an adaptive, plug-and-play solution and can be integrated seamlessly with existing DiTs, requiring neither additional fine-tuning nor a dataset-dependent profiling. Extensive experiments validate that AdaSpa delivers substantial acceleration across various models while preserving video quality, establishing itself as a robust and scalable approach to efficient video generation.

Paper Structure

This paper contains 15 sections, 14 equations, 11 figures, 2 tables, 2 algorithms.

Figures (11)

  • Figure 1: Comparison of the visualization effects of different sparse attention methods on HunyuanVideokong2024hunyuanvideo and CogVideoX1.5-5Byang2024cogvideox. Our method AdaSpa consistently achieves the best performance and the best speedup, and keep almost the same as original videos.
  • Figure 2: The total FLOPs required and the proportion of attention when generating 720p videos with different video lengths (16FPS).
  • Figure 3: Different types of Sparse Pattern recognition methods. (a) StreamingLLM: using a static sink+sliding window pattern, need no search or switch. (b) Sparse VideoGen: preparing two predefined Static Patterns, and using an online switching method to determine which to use. (c) MInference: preparing several dynamic patterns, first do an offline search to determine the target pattern to use, then perform an online approximate search to search suboptimal sparse indices of this pattern. (d) AdaSpa: our method proves that the most suitable pattern for DiT is blockified pattern, and performs an online precise search to find the optimal sparse indices for blockified pattern.
  • Figure 4: Different Attention Mechanisms in DiTs.
  • Figure 5: Typical attention weight maps from HunyuanVideo. Weight represents the visualization result of attention weights. Topk, Block, Col, Diag, Diag+Col represent the visualization results of sparse patterns under sparsity 0.9. The far right shows an enlarged view of the attention weights selected from the bottom right corner with a size of $(2*h*w+t) \times (2*h*w+t)$, where a clear hierarchical effect between frames can be observed. At the same time, there is a distinct boundary between the text modality and the pure video modality, exhibiting varying degrees of text sink effect. (720p, 129 frames, block size of the block pattern is 32)
  • ...and 6 more figures