Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile
Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng, Ion Stoica, Hao Zhang
TL;DR
Efficient-vDiT addresses the high inference cost of 3D full-attention video diffusion transformers by exploiting Attention Tile–a repetitive, diagonal-dominant pattern in 3D attention maps–to construct fixed, low-cost sparse attention with linear complexity. It combines three stages: Multi-Step Consistency Distillation to shorten sampling, a layer-wise search to tailor per-layer sparse masks, and knowledge distillation to recover fidelity in the sparse model, yielding up to about 7.8x speedups with minimal quality loss on Open-Sora-Plan variants. The approach also adapts to distributed inference, achieving up to 3.91x additional speedups on 4 GPUs via sequence parallelism. These techniques enable efficient high-resolution video generation with practical speedups and broad compatibility with existing diffusion frameworks while maintaining competitive perceptual quality.
Abstract
Despite the promise of synthesizing high-fidelity videos, Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps. For example, the popular Open-Sora-Plan model consumes more than 9 minutes for generating a single video of 29 frames. This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data; We identify a prevalent tile-style repetitive pattern in the 3D attention maps for video data, and advocate a new family of sparse 3D attention that holds a linear complexity w.r.t. the number of video frames. 2) Shorten the sampling process by adopting existing multi-step consistency distillation; We split the entire sampling trajectory into several segments and perform consistency distillation within each one to activate few-step generation capacities. We further devise a three-stage training pipeline to conjoin the low-complexity attention and few-step generation capacities. Notably, with 0.1% pretraining data, we turn the Open-Sora-Plan-1.2 model into an efficient one that is 7.4x -7.8x faster for 29 and 93 frames 720p video generation with a marginal performance trade-off in VBench. In addition, we demonstrate that our approach is amenable to distributed inference, achieving an additional 3.91x speedup when running on 4 GPUs with sequence parallelism.
