Unified Latents (UL): How to train your latents
Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, Tim Salimans
TL;DR
Unified Latents (UL) introduces a jointly trained encoder, diffusion prior, and diffusion decoder to learn latent representations that are regularized by a diffusion prior and decoded by a diffusion model. By tying the encoder’s forward-noise level to the prior’s minimum noise, UL obtains a simple upper bound on latent information content and a two-stage training flow with a sigmoid-weighted decoder ELBO. Empirically, UL achieves a competitive $\text{FID}=1.4$ on ImageNet-512 with high PSNR and reduced training FLOPs, and sets a new state-of-the-art $\text{FVD}=1.3$ on Kinetics-600; larger base models benefit from more informative latents, and bitrate can be tuned via a small set of hyperparameters. Overall, UL provides a principled, scalable pathway to efficient latent diffusion by explicitly controlling latent information content through diffusion-based priors and decoders.
Abstract
We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder's output noise to the prior's minimum noise level, we obtain a simple training objective that provides a tight upper bound on the latent bitrate. On ImageNet-512, our approach achieves competitive FID of 1.4, with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. On Kinetics-600, we set a new state-of-the-art FVD of 1.3.
