Table of Contents
Fetching ...

Unified Latents (UL): How to train your latents

Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, Tim Salimans

TL;DR

Unified Latents (UL) introduces a jointly trained encoder, diffusion prior, and diffusion decoder to learn latent representations that are regularized by a diffusion prior and decoded by a diffusion model. By tying the encoder’s forward-noise level to the prior’s minimum noise, UL obtains a simple upper bound on latent information content and a two-stage training flow with a sigmoid-weighted decoder ELBO. Empirically, UL achieves a competitive $\text{FID}=1.4$ on ImageNet-512 with high PSNR and reduced training FLOPs, and sets a new state-of-the-art $\text{FVD}=1.3$ on Kinetics-600; larger base models benefit from more informative latents, and bitrate can be tuned via a small set of hyperparameters. Overall, UL provides a principled, scalable pathway to efficient latent diffusion by explicitly controlling latent information content through diffusion-based priors and decoders.

Abstract

We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder's output noise to the prior's minimum noise level, we obtain a simple training objective that provides a tight upper bound on the latent bitrate. On ImageNet-512, our approach achieves competitive FID of 1.4, with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. On Kinetics-600, we set a new state-of-the-art FVD of 1.3.

Unified Latents (UL): How to train your latents

TL;DR

Unified Latents (UL) introduces a jointly trained encoder, diffusion prior, and diffusion decoder to learn latent representations that are regularized by a diffusion prior and decoded by a diffusion model. By tying the encoder’s forward-noise level to the prior’s minimum noise, UL obtains a simple upper bound on latent information content and a two-stage training flow with a sigmoid-weighted decoder ELBO. Empirically, UL achieves a competitive on ImageNet-512 with high PSNR and reduced training FLOPs, and sets a new state-of-the-art on Kinetics-600; larger base models benefit from more informative latents, and bitrate can be tuned via a small set of hyperparameters. Overall, UL provides a principled, scalable pathway to efficient latent diffusion by explicitly controlling latent information content through diffusion-based priors and decoders.

Abstract

We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder's output noise to the prior's minimum noise level, we obtain a simple training objective that provides a tight upper bound on the latent bitrate. On ImageNet-512, our approach achieves competitive FID of 1.4, with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. On Kinetics-600, we set a new state-of-the-art FVD of 1.3.
Paper Structure (34 sections, 5 equations, 10 figures, 6 tables, 2 algorithms)

This paper contains 34 sections, 5 equations, 10 figures, 6 tables, 2 algorithms.

Figures (10)

  • Figure 1: Schematic overview of our model, include the Encoder ($E_\theta$), the prior latent diffusion model ($P_\theta$), and the diffusion decoder model ($D_\theta$).
  • Figure 2: Unified Latents overview. An image ${\bm{x}}$ is encoded to ${\bm{z}}_\mathrm{clean}$. A diffusion prior models the path from pure noise ${\bm{z}}_1$ to a slightly noisy latent ${\bm{z}}_0$. This ${\bm{z}}_0$ is then used by a diffusion decoder to reconstruct the image. The prior thus measures and regularizes the information content of ${\bm{z}}_0$.
  • Figure 3: Decoder weighting on ${\bm{{\epsilon}}}$-mse, $w_{\bm{{\epsilon}}}(\lambda_t) = c_\mathrm{lf} \cdot \operatorname{sigmoid}(b - \lambda_t)$, showing which noise levels are penalized (via a loss factor $c_\mathrm{lf} = 1.6$ in this case) and which noise levels are discounted. In theory, for weightings above $1$ the latent model is preferred and for weightings below $1$ the decoder is preferred. In practise, the decoder will model information even if the weighting is slightly above $1$.
  • Figure 4: FID vs. training cost on ImageNet-512. UL outperforms all other approaches on base training compute versus generation equality We assume that one training iteration is three times as expensive as evaluating the model (i.e., forward pass, backprop to inputs, backprop to weights). Note that auto-encoder training cost is not included.
  • Figure 5: A selection of samples from a text-to-image trained with Unified Latents
  • ...and 5 more figures