Table of Contents
Fetching ...

Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile

Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng, Ion Stoica, Hao Zhang

TL;DR

Efficient-vDiT addresses the high inference cost of 3D full-attention video diffusion transformers by exploiting Attention Tile–a repetitive, diagonal-dominant pattern in 3D attention maps–to construct fixed, low-cost sparse attention with linear complexity. It combines three stages: Multi-Step Consistency Distillation to shorten sampling, a layer-wise search to tailor per-layer sparse masks, and knowledge distillation to recover fidelity in the sparse model, yielding up to about 7.8x speedups with minimal quality loss on Open-Sora-Plan variants. The approach also adapts to distributed inference, achieving up to 3.91x additional speedups on 4 GPUs via sequence parallelism. These techniques enable efficient high-resolution video generation with practical speedups and broad compatibility with existing diffusion frameworks while maintaining competitive perceptual quality.

Abstract

Despite the promise of synthesizing high-fidelity videos, Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps. For example, the popular Open-Sora-Plan model consumes more than 9 minutes for generating a single video of 29 frames. This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data; We identify a prevalent tile-style repetitive pattern in the 3D attention maps for video data, and advocate a new family of sparse 3D attention that holds a linear complexity w.r.t. the number of video frames. 2) Shorten the sampling process by adopting existing multi-step consistency distillation; We split the entire sampling trajectory into several segments and perform consistency distillation within each one to activate few-step generation capacities. We further devise a three-stage training pipeline to conjoin the low-complexity attention and few-step generation capacities. Notably, with 0.1% pretraining data, we turn the Open-Sora-Plan-1.2 model into an efficient one that is 7.4x -7.8x faster for 29 and 93 frames 720p video generation with a marginal performance trade-off in VBench. In addition, we demonstrate that our approach is amenable to distributed inference, achieving an additional 3.91x speedup when running on 4 GPUs with sequence parallelism.

Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile

TL;DR

Efficient-vDiT addresses the high inference cost of 3D full-attention video diffusion transformers by exploiting Attention Tile–a repetitive, diagonal-dominant pattern in 3D attention maps–to construct fixed, low-cost sparse attention with linear complexity. It combines three stages: Multi-Step Consistency Distillation to shorten sampling, a layer-wise search to tailor per-layer sparse masks, and knowledge distillation to recover fidelity in the sparse model, yielding up to about 7.8x speedups with minimal quality loss on Open-Sora-Plan variants. The approach also adapts to distributed inference, achieving up to 3.91x additional speedups on 4 GPUs via sequence parallelism. These techniques enable efficient high-resolution video generation with practical speedups and broad compatibility with existing diffusion frameworks while maintaining competitive perceptual quality.

Abstract

Despite the promise of synthesizing high-fidelity videos, Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps. For example, the popular Open-Sora-Plan model consumes more than 9 minutes for generating a single video of 29 frames. This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data; We identify a prevalent tile-style repetitive pattern in the 3D attention maps for video data, and advocate a new family of sparse 3D attention that holds a linear complexity w.r.t. the number of video frames. 2) Shorten the sampling process by adopting existing multi-step consistency distillation; We split the entire sampling trajectory into several segments and perform consistency distillation within each one to activate few-step generation capacities. We further devise a three-stage training pipeline to conjoin the low-complexity attention and few-step generation capacities. Notably, with 0.1% pretraining data, we turn the Open-Sora-Plan-1.2 model into an efficient one that is 7.4x -7.8x faster for 29 and 93 frames 720p video generation with a marginal performance trade-off in VBench. In addition, we demonstrate that our approach is amenable to distributed inference, achieving an additional 3.91x speedup when running on 4 GPUs with sequence parallelism.

Paper Structure

This paper contains 26 sections, 7 equations, 10 figures, 11 tables, 1 algorithm.

Figures (10)

  • Figure 1: We observe the Attention Tile pattern in 3D DiTs. (a) the attention map can be broken down into smaller repetitive blocks. (b) These blocks can be classified into two types, where attention weights on the diagonal blocks are noticeably larger than on off-diagonal ones. (c) These blocks exhibit locality, where the attention score differences between the first frame and later frames gradually increases. (d) The block structure is stable across different data points, but varies across layers. We randomly select 2 prompts (one landscape and one portrait) and record the important positions in the attention map that accounted for 90% (95%, 99%) of the total. We printed the proportion of stable overlap of important positions across layers.
  • Figure 2: Efficient-vDiT takes in a pre-trained 3D Full Attention video diffusion transformer(DiT), with slow inference speed and high fidelity. It then operates on three stages to greatly accelerate the inference while maintaining the fidelity. In Stage 1, we modify the multi-step consistency distillation framework from heek2024multistep to the video domain, which turned a DiT model to a CM model with stable training. In Stage 2, Efficient-vDiT performs a searching algorithm to find the best sparse attention pattern for each layer. In stage 3, Efficient-vDiT performs a knowledge distillation procedure to optimize the fidelity of the sparse DiT. At the end, Efficient-vDiT outputs a DiT with linear attention, high fidelity and fastest inference speed.
  • Figure 3: Exemplar attention mask ($2:6$). It maintains the attention in the main diagonals and against 2 global reference latent frames. Tile blocks in white are not computed.
  • Figure 4: Search results for Open-Sora-Plan v1.2 model (29 frames). We verify that different layers have different sparsity in 3D video DiTs.
  • Figure 5: Qualitative samples of our models. We compare the generation quality between the base model, MLCD model, and after knowledge distillation. Frames shown are equally spaced samples from the generated video. Efficient-vDiT is shortened as 'E-vdit' for simplicity. More samples can be found in Appendix \ref{['appendix:sample']}.
  • ...and 5 more figures