VORTA: Efficient Video Diffusion via Routing Sparse Attention

Wenhao Sun; Rong-Cheng Tu; Yifu Ding; Zhao Jin; Jingyi Liao; Shunyu Liu; Dacheng Tao

VORTA: Efficient Video Diffusion via Routing Sparse Attention

Wenhao Sun, Rong-Cheng Tu, Yifu Ding, Zhao Jin, Jingyi Liao, Shunyu Liu, Dacheng Tao

TL;DR

VORTA tackles the heavy computation of video diffusion transformers caused by quadratic attention over long sequences, where attention cost scales as $\mathcal{O}(L^2 d)$. It introduces a routing-based framework that combines two sparse attentions—sliding-window local attention and bucketed core-set long-range attention—alongside a signal-aware router that adaptively selects among them based on diffusion timesteps, with a core contribution in bucketed core-set selection achieving linear complexity $\mathcal{O}(L)$. The method is trained with a distillation-based objective and a lightweight routing optimization that preserves pretrained performance while freezing base models, yielding minimal overhead (~0.1% of parameters) and strong empirical results. Across backbones and schedulers, VORTA achieves a $1.76\times$ end-to-end speedup on VBench and up to $14.41\times$ when combined with other accelerations, while maintaining high VBench scores, demonstrating practical scalability for real-world video generation tasks.

Abstract

Video diffusion transformers have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent acceleration methods enhance the efficiency by exploiting the local sparsity of attention scores; yet they often struggle with accelerating the long-range computation. To address this problem, we propose VORTA, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants. VORTA achieves an end-to-end speedup $1.76\times$ without loss of quality on VBench. Furthermore, it can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41\times$ with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of video diffusion transformers in real-world settings. Codes and weights are available at https://github.com/wenhao728/VORTA.

VORTA: Efficient Video Diffusion via Routing Sparse Attention

TL;DR

VORTA tackles the heavy computation of video diffusion transformers caused by quadratic attention over long sequences, where attention cost scales as

. It introduces a routing-based framework that combines two sparse attentions—sliding-window local attention and bucketed core-set long-range attention—alongside a signal-aware router that adaptively selects among them based on diffusion timesteps, with a core contribution in bucketed core-set selection achieving linear complexity

. The method is trained with a distillation-based objective and a lightweight routing optimization that preserves pretrained performance while freezing base models, yielding minimal overhead (~0.1% of parameters) and strong empirical results. Across backbones and schedulers, VORTA achieves a

end-to-end speedup on VBench and up to

when combined with other accelerations, while maintaining high VBench scores, demonstrating practical scalability for real-world video generation tasks.

Abstract

without loss of quality on VBench. Furthermore, it can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup

with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of video diffusion transformers in real-world settings. Codes and weights are available at https://github.com/wenhao728/VORTA.

VORTA: Efficient Video Diffusion via Routing Sparse Attention

TL;DR

Abstract

VORTA: Efficient Video Diffusion via Routing Sparse Attention

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)