Table of Contents
Fetching ...

VORTA: Efficient Video Diffusion via Routing Sparse Attention

Wenhao Sun, Rong-Cheng Tu, Yifu Ding, Zhao Jin, Jingyi Liao, Shunyu Liu, Dacheng Tao

TL;DR

VORTA tackles the heavy computation of video diffusion transformers caused by quadratic attention over long sequences, where attention cost scales as $\mathcal{O}(L^2 d)$. It introduces a routing-based framework that combines two sparse attentions—sliding-window local attention and bucketed core-set long-range attention—alongside a signal-aware router that adaptively selects among them based on diffusion timesteps, with a core contribution in bucketed core-set selection achieving linear complexity $\mathcal{O}(L)$. The method is trained with a distillation-based objective and a lightweight routing optimization that preserves pretrained performance while freezing base models, yielding minimal overhead (~0.1% of parameters) and strong empirical results. Across backbones and schedulers, VORTA achieves a $1.76\times$ end-to-end speedup on VBench and up to $14.41\times$ when combined with other accelerations, while maintaining high VBench scores, demonstrating practical scalability for real-world video generation tasks.

Abstract

Video diffusion transformers have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent acceleration methods enhance the efficiency by exploiting the local sparsity of attention scores; yet they often struggle with accelerating the long-range computation. To address this problem, we propose VORTA, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants. VORTA achieves an end-to-end speedup $1.76\times$ without loss of quality on VBench. Furthermore, it can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41\times$ with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of video diffusion transformers in real-world settings. Codes and weights are available at https://github.com/wenhao728/VORTA.

VORTA: Efficient Video Diffusion via Routing Sparse Attention

TL;DR

VORTA tackles the heavy computation of video diffusion transformers caused by quadratic attention over long sequences, where attention cost scales as . It introduces a routing-based framework that combines two sparse attentions—sliding-window local attention and bucketed core-set long-range attention—alongside a signal-aware router that adaptively selects among them based on diffusion timesteps, with a core contribution in bucketed core-set selection achieving linear complexity . The method is trained with a distillation-based objective and a lightweight routing optimization that preserves pretrained performance while freezing base models, yielding minimal overhead (~0.1% of parameters) and strong empirical results. Across backbones and schedulers, VORTA achieves a end-to-end speedup on VBench and up to when combined with other accelerations, while maintaining high VBench scores, demonstrating practical scalability for real-world video generation tasks.

Abstract

Video diffusion transformers have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent acceleration methods enhance the efficiency by exploiting the local sparsity of attention scores; yet they often struggle with accelerating the long-range computation. To address this problem, we propose VORTA, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants. VORTA achieves an end-to-end speedup without loss of quality on VBench. Furthermore, it can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of video diffusion transformers in real-world settings. Codes and weights are available at https://github.com/wenhao728/VORTA.

Paper Structure

This paper contains 53 sections, 7 equations, 16 figures, 5 tables, 1 algorithm.

Figures (16)

  • Figure 1: VORTA enables lossless acceleration of video diffusion transformers HunyuanWan, and remains compatible with other acceleration methods such as PAB PAB and PCD PCD for additional speedups.
  • Figure 2: Attention scores recalled by the nearest keys. (left) Attention scores are predominantly concentrated within a local neighborhood. (right) The locality is less pronounced at earlier sampling steps. Results are from the 20th (of 60) layer in HunyuanVideo Hunyuan. Only 8 out of the 24 attention heads are shown for clarity.
  • Figure 3: (left) Generation with the complete sampling process. (middle) Intermediate generation result. (right) Intermediate generation result with only core-set predictions.
  • Figure 4: Illustration of converting a sliding window mask into a sliding tile mask. A 1D attention mask is shown for simplicity, with both the window size and tile size set to 2.
  • Figure 5: Bucketed Core-set Selection (BCS). For clarity, the 2D image is used for illustration; the actual inputs and buckets operate on 3D video data within the latent space. In this example, the top $k=4$ tokens from each bucket are dropped.
  • ...and 11 more figures