Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability
Shizhan Liu, Xinran Deng, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu, Jie Tang
TL;DR
This work analyzes how latent-space spectra in video VAEs shape diffusion training, identifying that a spatio-temporal low-frequency bias and a few-mode channel eigenspectrum enhance diffusability. It introduces two lightweight regularizers, Local Correlation Regularization (LCR) and Latent Masked Reconstruction (LMR), to induce these spectral properties, resulting in the Spectral-Structured VAE (SSVAE). Empirical results demonstrate a ~3× faster convergence and ~10% higher video reward across multiple backbones and resolutions, underscoring the practical impact for diffusion-based text-to-video generation. The approach is backbone-agnostic and offers a principled, modular path to improving generative video modeling by shaping latent space rather than only improving reconstruction fidelity.
Abstract
Latent diffusion models pair VAEs with diffusion backbones, and the structure of VAE latents strongly influences the difficulty of diffusion training. However, existing video VAEs typically focus on reconstruction fidelity, overlooking latent structure. We present a statistical analysis of video VAE latent spaces and identify two spectral properties essential for diffusion training: a spatio-temporal frequency spectrum biased toward low frequencies, and a channel-wise eigenspectrum dominated by a few modes. To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization and Latent Masked Reconstruction. Experiments show that our Spectral-Structured VAE (SSVAE) achieves a $3\times$ speedup in text-to-video generation convergence and a 10\% gain in video reward, outperforming strong open-source VAEs. The code is available at https://github.com/zai-org/SSVAE.
