Table of Contents
Fetching ...

DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance

Xuan Shen, Chenxia Han, Yufa Zhou, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jiuxiang Gu

TL;DR

Diffusion Transformer-based video models deliver high-quality generation but suffer from prohibitive attention-driven compute. DraftAttention provides a training-free, two-stage approach that uses a low-resolution draft attention map to guide a full-resolution sparse attention, with a hardware-friendly token reordering to enable efficient block computations. The method comes with theoretical bounds on the approximation error and demonstrates up to 1.75x end-to-end GPU acceleration while preserving video quality across multiple models and resolutions. It is plug-and-play and complementary to other acceleration strategies, with potential for further gains via quantization.

Abstract

Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency, and generating just 8 seconds of 720p video takes tens of minutes-posing serious challenges to practical application and scalability. To address this, we propose the DraftAttention, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs. We apply down-sampling to each feature map across frames in the compressed latent space, enabling a higher-level receptive field over the latent composed of hundreds of thousands of tokens. The low-resolution draft attention map, derived from draft query and key, exposes redundancy both spatially within each feature map and temporally across frames. We reorder the query, key, and value based on the draft attention map to guide the sparse attention computation in full resolution, and subsequently restore their original order after the attention computation. This reordering enables structured sparsity that aligns with hardware-optimized execution. Our theoretical analysis demonstrates that the low-resolution draft attention closely approximates the full attention, providing reliable guidance for constructing accurate sparse attention. Experimental results show that our method outperforms existing sparse attention approaches in video generation quality and achieves up to 1.75x end-to-end speedup on GPUs. Code: https://github.com/shawnricecake/draft-attention

DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance

TL;DR

Diffusion Transformer-based video models deliver high-quality generation but suffer from prohibitive attention-driven compute. DraftAttention provides a training-free, two-stage approach that uses a low-resolution draft attention map to guide a full-resolution sparse attention, with a hardware-friendly token reordering to enable efficient block computations. The method comes with theoretical bounds on the approximation error and demonstrates up to 1.75x end-to-end GPU acceleration while preserving video quality across multiple models and resolutions. It is plug-and-play and complementary to other acceleration strategies, with potential for further gains via quantization.

Abstract

Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency, and generating just 8 seconds of 720p video takes tens of minutes-posing serious challenges to practical application and scalability. To address this, we propose the DraftAttention, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs. We apply down-sampling to each feature map across frames in the compressed latent space, enabling a higher-level receptive field over the latent composed of hundreds of thousands of tokens. The low-resolution draft attention map, derived from draft query and key, exposes redundancy both spatially within each feature map and temporally across frames. We reorder the query, key, and value based on the draft attention map to guide the sparse attention computation in full resolution, and subsequently restore their original order after the attention computation. This reordering enables structured sparsity that aligns with hardware-optimized execution. Our theoretical analysis demonstrates that the low-resolution draft attention closely approximates the full attention, providing reliable guidance for constructing accurate sparse attention. Experimental results show that our method outperforms existing sparse attention approaches in video generation quality and achieves up to 1.75x end-to-end speedup on GPUs. Code: https://github.com/shawnricecake/draft-attention

Paper Structure

This paper contains 26 sections, 2 theorems, 17 equations, 8 figures, 1 table, 2 algorithms.

Key Result

Theorem 3.3

If all regions have equal size $|R_i| = n/g$, then the Frobenius-norm error between the full and draft logit matrices is bounded by:

Figures (8)

  • Figure 1: FLOPs breakdown for 720p video generation with Hunyuan Video.
  • Figure 2: Whole DraftAttention Pipeline. Both the query and key are reshaped into sequences of feature maps across frames, then downsampled via average pooling to produce the low-resolution draft query and draft key. Draft attention is computed using the flattened draft query and key. The full-resolution query and key need to be reordered for the alignment of draft attention guidance.
  • Figure 3: Illustration for the necessity of the reordering. The "$\text{x}y$" in attention map denotes attentivity between token $\text{x}$ in query and token $y$ in key. Grouping the sparse pattern enables hardware-friendly layout, leading to faster attention computation.
  • Figure 4: Latency results tested in 768p with H100 GPU for different sparsity ratios in attention.
  • Figure 5: Visualization for our method and SVG xi2025sparse_efficient_dit_video_sparseattn with 90% sparsity ratio in attention.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Definition 3.1: Full Attention
  • Definition 3.2: Draft Attention via Average Pooling
  • Theorem 3.3: Draft Attention Error
  • Remark 3.4
  • Theorem 3.5: Sparsity Mask Error
  • Remark 3.6
  • proof
  • proof