Table of Contents
Fetching ...

LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

Yu Cheng, Fajie Yuan

TL;DR

LeanVAE addresses the computational bottleneck of Video VAEs in Latent Video Diffusion Models by introducing a lightweight, patch-based backbone (Neighborhood-Aware Feedforward) and enriching inputs with Haar wavelet transforms, plus a novel CS-based latent channel bottleneck using ISTA-Net+. The architecture achieves dramatic efficiency gains (up to 50× fewer FLOPs and up to 44× faster inference) while maintaining competitive reconstruction quality, and it enhances generation performance when paired with diffusion-based video models. Extensive ablations demonstrate the benefits of separate LC/HC processing, CS over traditional autoencoders, and avoiding patch normalization to prevent block artifacts. The work demonstrates practical, scalable improvements for high-resolution video generation and reconstruction, with potential applicability to broader LVDM workflows and future exploration of higher compression and discrete latent spaces.

Abstract

Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent space. However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE's superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video VAEs. Our model offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video generation. Our models and code are available at https://github.com/westlake-repl/LeanVAE

LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

TL;DR

LeanVAE addresses the computational bottleneck of Video VAEs in Latent Video Diffusion Models by introducing a lightweight, patch-based backbone (Neighborhood-Aware Feedforward) and enriching inputs with Haar wavelet transforms, plus a novel CS-based latent channel bottleneck using ISTA-Net+. The architecture achieves dramatic efficiency gains (up to 50× fewer FLOPs and up to 44× faster inference) while maintaining competitive reconstruction quality, and it enhances generation performance when paired with diffusion-based video models. Extensive ablations demonstrate the benefits of separate LC/HC processing, CS over traditional autoencoders, and avoiding patch normalization to prevent block artifacts. The work demonstrates practical, scalable improvements for high-resolution video generation and reconstruction, with potential applicability to broader LVDM workflows and future exploration of higher compression and discrete latent spaces.

Abstract

Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent space. However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE's superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video VAEs. Our model offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video generation. Our models and code are available at https://github.com/westlake-repl/LeanVAE

Paper Structure

This paper contains 20 sections, 7 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: (a) LeanVAE framework overview. (b) Key components: Patchifier for image-video joint patching in frequency domain; Encoder for hierarchical feature extraction; (Res)NAF serves as model backbone, enabling Neighborhood-Aware Feedforward (with Residual connections); Latent Channel Bottleneck for latent channel compression and restoration based on $\textit{ISTA-Net}^{+}$ algorithm.
  • Figure 2: Qualitative comparison between LeanVAE and leading baselines. Due to space limitations, we present only the model with a latent channel size 4. The reconstruction performance of the leading models with 16 latent channels is notably better, and their visual differences are subtle. More comprehensive visual comparisons are available in the supplementary video.
  • Figure 2: Ablation study on different components. Variant 2 (highlighted in gray) serves as the baseline across all groups, with CS channel compression and without patch normalization.
  • Figure 3: Comparison across multiple resolutions. (a) Computational cost in terms of TFLOPs(bar plots, labeled in black) and encoding-decoding time (line plots, labeled in red). (b) Reconstruction quality metrics. All evaluations were conducted on 17-frame videos using a single NVIDIA A40 (48GB) GPU.
  • Figure 4: Examples of block artifacts in reconstructed video. Left:ground truth; Middle:reconstruction of model w/o pixel normalization; Right:reconstruction of model w/ pixel normalization.
  • ...and 5 more figures