Table of Contents
Fetching ...

Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation

Lunjie Zhu, Yushi Huang, Xingtong Ge, Yufei Xue, Zhening Liu, Yumeng Zhang, Zehong Lin, Jun Zhang

TL;DR

This work proposes a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution, and designs a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED.

Abstract

Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel pruning method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE decoders. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE decoders demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6$\times$ speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.

Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation

TL;DR

This work proposes a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution, and designs a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED.

Abstract

Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel pruning method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE decoders. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE decoders demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6 speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.
Paper Structure (20 sections, 8 equations, 13 figures, 6 tables)

This paper contains 20 sections, 8 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Qualitative and quantitative comparisons of video reconstruction results. We evaluate Flash-VAED (Bottom) against the original VAE decoder (Top) and the current state-of-the-art baseline (Middle). Flash-VAED offers the fastest decoding speed with minimal loss of fidelity to the original VAE decoder.
  • Figure 2: Overview of the Flash-VAED architecture. The proposed stage-wise dominant operator optimization substitutes CausalConv3D with stage-specific efficient operators (left), tailored to each decoding stage. Moreover, the independence-aware channel pruning method (right) reduces the channel count to 12.5% $-$ 25% of the original with minimal quality loss, leveraging channel independence.
  • Figure 3: Channel-wise similarity analysis. We visualize the top-8 channels most similar to Channel 0. Although the feature maps exhibit visual similarity, the quantitative similarity scores are not high enough to support pruning.
  • Figure 4: SVD analysis on channel features. The curve for cumulative explained variance ratio reveals the intrinsic low-rank nature of the feature maps. Notably, only 22 components (22.9% of the total) are required to explain 99% of the feature variance, providing strong empirical support for our pruning method.
  • Figure 5: Effectiveness of pre-pruning channel enhancement. We compare the reconstruction fidelity $\mathrm{R^2}$ of the retained channels before and after enhancement. The significant improvement across all layers validates that our strategy effectively forces the retained channels to encode more information.
  • ...and 8 more figures