Table of Contents
Fetching ...

Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

Shizhan Liu, Xinran Deng, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu, Jie Tang

TL;DR

This work analyzes how latent-space spectra in video VAEs shape diffusion training, identifying that a spatio-temporal low-frequency bias and a few-mode channel eigenspectrum enhance diffusability. It introduces two lightweight regularizers, Local Correlation Regularization (LCR) and Latent Masked Reconstruction (LMR), to induce these spectral properties, resulting in the Spectral-Structured VAE (SSVAE). Empirical results demonstrate a ~3× faster convergence and ~10% higher video reward across multiple backbones and resolutions, underscoring the practical impact for diffusion-based text-to-video generation. The approach is backbone-agnostic and offers a principled, modular path to improving generative video modeling by shaping latent space rather than only improving reconstruction fidelity.

Abstract

Latent diffusion models pair VAEs with diffusion backbones, and the structure of VAE latents strongly influences the difficulty of diffusion training. However, existing video VAEs typically focus on reconstruction fidelity, overlooking latent structure. We present a statistical analysis of video VAE latent spaces and identify two spectral properties essential for diffusion training: a spatio-temporal frequency spectrum biased toward low frequencies, and a channel-wise eigenspectrum dominated by a few modes. To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization and Latent Masked Reconstruction. Experiments show that our Spectral-Structured VAE (SSVAE) achieves a $3\times$ speedup in text-to-video generation convergence and a 10\% gain in video reward, outperforming strong open-source VAEs. The code is available at https://github.com/zai-org/SSVAE.

Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

TL;DR

This work analyzes how latent-space spectra in video VAEs shape diffusion training, identifying that a spatio-temporal low-frequency bias and a few-mode channel eigenspectrum enhance diffusability. It introduces two lightweight regularizers, Local Correlation Regularization (LCR) and Latent Masked Reconstruction (LMR), to induce these spectral properties, resulting in the Spectral-Structured VAE (SSVAE). Empirical results demonstrate a ~3× faster convergence and ~10% higher video reward across multiple backbones and resolutions, underscoring the practical impact for diffusion-based text-to-video generation. The approach is backbone-agnostic and offers a principled, modular path to improving generative video modeling by shaping latent space rather than only improving reconstruction fidelity.

Abstract

Latent diffusion models pair VAEs with diffusion backbones, and the structure of VAE latents strongly influences the difficulty of diffusion training. However, existing video VAEs typically focus on reconstruction fidelity, overlooking latent structure. We present a statistical analysis of video VAE latent spaces and identify two spectral properties essential for diffusion training: a spatio-temporal frequency spectrum biased toward low frequencies, and a channel-wise eigenspectrum dominated by a few modes. To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization and Latent Masked Reconstruction. Experiments show that our Spectral-Structured VAE (SSVAE) achieves a speedup in text-to-video generation convergence and a 10\% gain in video reward, outperforming strong open-source VAEs. The code is available at https://github.com/zai-org/SSVAE.

Paper Structure

This paper contains 23 sections, 2 theorems, 14 equations, 14 figures, 5 tables.

Key Result

Theorem 1

$\Sigma_{{\mathbf{v}}{\mathbf{u}}}(t)$ has the same eigenvectors as $\Sigma_{{\mathbf{u}}{\mathbf{u}}}$, and its eigenvalue on the $l$-th eigenvector of $\Sigma_{{\mathbf{u}}{\mathbf{u}}}$ is given by: Please refer to the supplementary Sec. 2 for the proof of Theorem thm:eigen_value_relation.

Figures (14)

  • Figure 1: We identify that both a low-frequency biased spatio-temporal frequency spectrum and a few-mode biased channel eigenspectrum facilitate diffusion training. By inducing low-frequency bias, few-mode bias and enhancing decoder robustness, our SSVAE achieves a $3\times$ convergence speedup over the baseline on $17\times 512\times 512$ generation, using prompts from VBench.
  • Figure 2: (a) SER and VA-VAE do not sufficiently suppress high-frequency components, whereas LCR exhibits the strongest low-frequency bias. (b) In general, steeper PSDs correspond to larger local correlation, and result in better video generation quality.
  • Figure 3: LCR introduces a low-frequency bias by promoting pairwise correlations within each spatio-temporal local patch in the normalized latents. We omit the channel dimension for simplicity.
  • Figure 4: Comparative analysis of the VAE latent channel covariance matrix and the diffusion output–input cross-correlation matrix. A few-mode-biased latent space is associated with a lower diffusion loss scale, higher generation quality, and faster convergence.
  • Figure 5: LMR introduces a few-mode bias by reconstructing videos using spatio-temporally masked latents. We omit the channel dimension for simplicity.
  • ...and 9 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 1
  • proof