Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation

Lunjie Zhu; Yushi Huang; Xingtong Ge; Yufei Xue; Zhening Liu; Yumeng Zhang; Zehong Lin; Jun Zhang

Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation

Lunjie Zhu, Yushi Huang, Xingtong Ge, Yufei Xue, Zhening Liu, Yumeng Zhang, Zehong Lin, Jun Zhang

TL;DR

This work proposes a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution, and designs a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED.

Abstract

Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel pruning method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE decoders. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE decoders demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6$\times$ speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.

Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation

TL;DR

Abstract

speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.

Paper Structure (20 sections, 8 equations, 13 figures, 6 tables)

This paper contains 20 sections, 8 equations, 13 figures, 6 tables.

Introduction
Related Work
Method
Independence-Aware Channel Pruning
Stage-Wise Dominant Operator Optimization
Training Strategy: Three-Phase Dynamic Distillation Framework
Experiments
Experimental Setup
Main Results
Ablation Study
Conclusions
Implementation Details
Mechanism of Depthwise Separable Convolutions
Detailed Analysis of the Effectiveness of Shortcut Injection
Gradient Mask for Training Stage 2
...and 5 more sections

Figures (13)

Figure 1: Qualitative and quantitative comparisons of video reconstruction results. We evaluate Flash-VAED (Bottom) against the original VAE decoder (Top) and the current state-of-the-art baseline (Middle). Flash-VAED offers the fastest decoding speed with minimal loss of fidelity to the original VAE decoder.
Figure 2: Overview of the Flash-VAED architecture. The proposed stage-wise dominant operator optimization substitutes CausalConv3D with stage-specific efficient operators (left), tailored to each decoding stage. Moreover, the independence-aware channel pruning method (right) reduces the channel count to 12.5% $-$ 25% of the original with minimal quality loss, leveraging channel independence.
Figure 3: Channel-wise similarity analysis. We visualize the top-8 channels most similar to Channel 0. Although the feature maps exhibit visual similarity, the quantitative similarity scores are not high enough to support pruning.
Figure 4: SVD analysis on channel features. The curve for cumulative explained variance ratio reveals the intrinsic low-rank nature of the feature maps. Notably, only 22 components (22.9% of the total) are required to explain 99% of the feature variance, providing strong empirical support for our pruning method.
Figure 5: Effectiveness of pre-pruning channel enhancement. We compare the reconstruction fidelity $\mathrm{R^2}$ of the retained channels before and after enhancement. The significant improvement across all layers validates that our strategy effectively forces the retained channels to encode more information.
...and 8 more figures

Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation

TL;DR

Abstract

Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)