Table of Contents
Fetching ...

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, Li Yuan

TL;DR

The paper tackles the high computational cost and latent-space discontinuities of video VAEs used in Latent Video Diffusion Models. It introduces WF-VAE, which leverages multi-level Haar wavelet transforms to create a low-frequency energy-flow pathway into the latent space, reducing backbone complexity. A Causal Cache mechanism ensures lossless block-wise inference, preserving temporal continuity across long videos. Empirical results show WF-VAE achieves superior reconstruction quality with substantially lower memory and compute requirements, enabling scalable pre-training for video diffusion. The work offers practical gains for large-scale video generation pipelines and sets a new efficiency benchmark for video VAE architectures.

Abstract

Video Variational Autoencoder (VAE) encodes videos into a low-dimensional latent space, becoming a key component of most Latent Video Diffusion Models (LVDMs) to reduce model training costs. However, as the resolution and duration of generated videos increase, the encoding cost of Video VAEs becomes a limiting bottleneck in training LVDMs. Moreover, the block-wise inference method adopted by most LVDMs can lead to discontinuities of latent space when processing long-duration videos. The key to addressing the computational bottleneck lies in decomposing videos into distinct components and efficiently encoding the critical information. Wavelet transform can decompose videos into multiple frequency-domain components and improve the efficiency significantly, we thus propose Wavelet Flow VAE (WF-VAE), an autoencoder that leverages multi-level wavelet transform to facilitate low-frequency energy flow into latent representation. Furthermore, we introduce a method called Causal Cache, which maintains the integrity of latent space during block-wise inference. Compared to state-of-the-art video VAEs, WF-VAE demonstrates superior performance in both PSNR and LPIPS metrics, achieving 2x higher throughput and 4x lower memory consumption while maintaining competitive reconstruction quality. Our code and models are available at https://github.com/PKU-YuanGroup/WF-VAE.

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

TL;DR

The paper tackles the high computational cost and latent-space discontinuities of video VAEs used in Latent Video Diffusion Models. It introduces WF-VAE, which leverages multi-level Haar wavelet transforms to create a low-frequency energy-flow pathway into the latent space, reducing backbone complexity. A Causal Cache mechanism ensures lossless block-wise inference, preserving temporal continuity across long videos. Empirical results show WF-VAE achieves superior reconstruction quality with substantially lower memory and compute requirements, enabling scalable pre-training for video diffusion. The work offers practical gains for large-scale video generation pipelines and sets a new efficiency benchmark for video VAE architectures.

Abstract

Video Variational Autoencoder (VAE) encodes videos into a low-dimensional latent space, becoming a key component of most Latent Video Diffusion Models (LVDMs) to reduce model training costs. However, as the resolution and duration of generated videos increase, the encoding cost of Video VAEs becomes a limiting bottleneck in training LVDMs. Moreover, the block-wise inference method adopted by most LVDMs can lead to discontinuities of latent space when processing long-duration videos. The key to addressing the computational bottleneck lies in decomposing videos into distinct components and efficiently encoding the critical information. Wavelet transform can decompose videos into multiple frequency-domain components and improve the efficiency significantly, we thus propose Wavelet Flow VAE (WF-VAE), an autoencoder that leverages multi-level wavelet transform to facilitate low-frequency energy flow into latent representation. Furthermore, we introduce a method called Causal Cache, which maintains the integrity of latent space during block-wise inference. Compared to state-of-the-art video VAEs, WF-VAE demonstrates superior performance in both PSNR and LPIPS metrics, achieving 2x higher throughput and 4x lower memory consumption while maintaining competitive reconstruction quality. Our code and models are available at https://github.com/PKU-YuanGroup/WF-VAE.

Paper Structure

This paper contains 13 sections, 7 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Performance comparison of video VAEs. Bubble area indicates the memory usage during inference. All measurements are conducted on 33 frames with 256×256 resolution videos. "Chn" represents the number of latent channels. Higher PSNR and throughput indicate better performance.
  • Figure 2: Overview of WF-VAE. Our architecture consists of a backbone and a main energy flow pathway. The pathway functions as a “highway” for the main flow of video energy, channeling this energy into the backbone through concatenations, allowing more critical video information to be preserved in the latent representation.
  • Figure 2: Quantitative evaluation of different VAE models for video generation. We assess video generation quality using FVD$_{16}$ on both SkyTimelapse and UCF-101 datasets, and IS on UCF-101 following prior work ma2024latte.
  • Figure 3: (a)Causal Cache with a temporal kernel size of 3 and stride 1. (b) Comparison of tiling inference and Causal Cache, highlighting how tiling causes locally color and shape distortions at overlaps, leading to global flickering in reconstructed videos.
  • Figure 4: Computational performance of encoding and decoding. We evaluate the encoding, decoding time, and memory consumption across 33 frames with 256×256, 512×512, and 768×768 resolutions (benchmark models without causal convolution are tested with 32 frames). WF-VAE surpasses other VAE models by a large margin in terms of both inference speed and memory efficiency.
  • ...and 3 more figures