Table of Contents
Fetching ...

Distribution Matching Variational AutoEncoder

Sen Ye, Jianning Pei, Mengde Xu, Shuyang Gu, Chunyu Wang, Liwei Wang, Han Hu

TL;DR

DMVAE addresses how to choose latent priors in two-stage visual generation by explicitly matching the encoder's aggregate posterior to a flexible reference distribution using diffusion-score distillation. This enables exploring distribution forms beyond Gaussian priors, including SSL-derived priors, diffusion-noise, and text-embedding distributions. The paper demonstrates that SSL-derived priors yield a favorable balance between reconstruction fidelity and modeling efficiency, achieving gFID of 3.22 on ImageNet at 256x256 with 64 training epochs and 1.82 with 400 epochs, and shows faster convergence relative to prior tokenizers. The work highlights that selecting an appropriate latent-distribution structure, via distribution-level alignment, is key to bridging easy-to-model latents and high-fidelity synthesis.

Abstract

Most visual generative models compress images into a latent space before applying diffusion or autoregressive modelling. Yet, existing approaches such as VAEs and foundation model aligned encoders implicitly constrain the latent space without explicitly shaping its distribution, making it unclear which types of distributions are optimal for modeling. We introduce \textbf{Distribution-Matching VAE} (\textbf{DMVAE}), which explicitly aligns the encoder's latent distribution with an arbitrary reference distribution via a distribution matching constraint. This generalizes beyond the Gaussian prior of conventional VAEs, enabling alignment with distributions derived from self-supervised features, diffusion noise, or other prior distributions. With DMVAE, we can systematically investigate which latent distributions are more conducive to modeling, and we find that SSL-derived distributions provide an excellent balance between reconstruction fidelity and modeling efficiency, reaching gFID equals 3.2 on ImageNet with only 64 training epochs. Our results suggest that choosing a suitable latent distribution structure (achieved via distribution-level alignment), rather than relying on fixed priors, is key to bridging the gap between easy-to-model latents and high-fidelity image synthesis. Code is avaliable at https://github.com/sen-ye/dmvae.

Distribution Matching Variational AutoEncoder

TL;DR

DMVAE addresses how to choose latent priors in two-stage visual generation by explicitly matching the encoder's aggregate posterior to a flexible reference distribution using diffusion-score distillation. This enables exploring distribution forms beyond Gaussian priors, including SSL-derived priors, diffusion-noise, and text-embedding distributions. The paper demonstrates that SSL-derived priors yield a favorable balance between reconstruction fidelity and modeling efficiency, achieving gFID of 3.22 on ImageNet at 256x256 with 64 training epochs and 1.82 with 400 epochs, and shows faster convergence relative to prior tokenizers. The work highlights that selecting an appropriate latent-distribution structure, via distribution-level alignment, is key to bridging easy-to-model latents and high-fidelity synthesis.

Abstract

Most visual generative models compress images into a latent space before applying diffusion or autoregressive modelling. Yet, existing approaches such as VAEs and foundation model aligned encoders implicitly constrain the latent space without explicitly shaping its distribution, making it unclear which types of distributions are optimal for modeling. We introduce \textbf{Distribution-Matching VAE} (\textbf{DMVAE}), which explicitly aligns the encoder's latent distribution with an arbitrary reference distribution via a distribution matching constraint. This generalizes beyond the Gaussian prior of conventional VAEs, enabling alignment with distributions derived from self-supervised features, diffusion noise, or other prior distributions. With DMVAE, we can systematically investigate which latent distributions are more conducive to modeling, and we find that SSL-derived distributions provide an excellent balance between reconstruction fidelity and modeling efficiency, reaching gFID equals 3.2 on ImageNet with only 64 training epochs. Our results suggest that choosing a suitable latent distribution structure (achieved via distribution-level alignment), rather than relying on fixed priors, is key to bridging the gap between easy-to-model latents and high-fidelity image synthesis. Code is avaliable at https://github.com/sen-ye/dmvae.

Paper Structure

This paper contains 37 sections, 10 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Illustrution of VAE kingma2013auto, RAE zheng2025diffusion, pointwise matching encoder yao2025reconstructionchen2025aligning, and Distribution Matching VAE.
  • Figure 2: The training pipeline of Distribution Matching VAE.
  • Figure 3: Analysis of different distribution matching objectives on a 2D setting. (a) illustrates the reference distribution; (b) denotes the distribution matching objective, (d,e,f) represent real score maximization with different stopping gradients; (g,h,i) represent methods for fake score maximization; (j,k,l) represent directly optimizing the difference between real and fake scores; Finally, (c) represents optimizing the difference between the real and fake diffusion losses. For each objective, we have listed the loss function and their gradient $\nabla_{z} \mathcal{L}$.
  • Figure 4: Illustration of t-SNE on different distributions. (a-d) represent four different reference distributions, (e-h) represent the distribution of the DMVAE encoder learned from these four reference distributions, (i-k) represent the encoder distribution learned from the data-independent distribution, and (l) represents the distribution of the $\beta$-VAE encoder output.
  • Figure 5: Different CFG scale represents different distributions.
  • ...and 3 more figures