Table of Contents
Fetching ...

Boosting Latent Diffusion with Perceptual Objectives

Tariq Berrada, Pietro Astolfi, Melissa Hall, Marton Havasi, Yohann Benchetrit, Adriana Romero-Soriano, Karteek Alahari, Michal Drozdzal, Jakob Verbeek

TL;DR

The paper addresses the decoder–diffusion disconnect in latent diffusion models by introducing Latent Perceptual Loss (LPL), which leverages intermediate features of the autoencoder's decoder to guide training toward sharper, more realistic images. LPL is defined as a multiscale, decoder-feature distance between the decoded original and estimated latents, integrated with the standard diffusion objective as \mathcal{L}_\text{tot} = \mathcal{L}_\text{Diff} + w_\mathrm{LPL} \mathcal{L}_\textrm{LPL} and applicable to DDPM in epsilon/velocity modes as well as Flow-OT. Across ImageNet-1k, CC12M, and S320M at 256 and 512 resolutions, LPL consistently improves FID (roughly 0.5–1.5 points depending on setting) and CLIPScore, with qualitative benefits of finer textures and high-frequency details; ablations show deeper decoder layers and per-channel normalization are beneficial, while the approach incurs modest memory overhead. Overall, LPL provides a general, architecture-agnostic enhancement for latent generative models, improving perceptual quality without requiring specialized data or changes to the denoiser structure, and potentially influencing future latent-diffusion training paradigms.

Abstract

Latent diffusion models (LDMs) power state-of-the-art high-resolution generative image models. LDMs learn the data distribution in the latent space of an autoencoder (AE) and produce images by mapping the generated latents into RGB image space using the AE decoder. While this approach allows for efficient model training and sampling, it induces a disconnect between the training of the diffusion model and the decoder, resulting in a loss of detail in the generated images. To remediate this disconnect, we propose to leverage the internal features of the decoder to define a latent perceptual loss (LPL). This loss encourages the models to create sharper and more realistic images. Our loss can be seamlessly integrated with common autoencoders used in latent diffusion models, and can be applied to different generative modeling paradigms such as DDPM with epsilon and velocity prediction, as well as flow matching. Extensive experiments with models trained on three datasets at 256 and 512 resolution show improved quantitative -- with boosts between 6% and 20% in FID -- and qualitative results when using our perceptual loss.

Boosting Latent Diffusion with Perceptual Objectives

TL;DR

The paper addresses the decoder–diffusion disconnect in latent diffusion models by introducing Latent Perceptual Loss (LPL), which leverages intermediate features of the autoencoder's decoder to guide training toward sharper, more realistic images. LPL is defined as a multiscale, decoder-feature distance between the decoded original and estimated latents, integrated with the standard diffusion objective as \mathcal{L}_\text{tot} = \mathcal{L}_\text{Diff} + w_\mathrm{LPL} \mathcal{L}_\textrm{LPL} and applicable to DDPM in epsilon/velocity modes as well as Flow-OT. Across ImageNet-1k, CC12M, and S320M at 256 and 512 resolutions, LPL consistently improves FID (roughly 0.5–1.5 points depending on setting) and CLIPScore, with qualitative benefits of finer textures and high-frequency details; ablations show deeper decoder layers and per-channel normalization are beneficial, while the approach incurs modest memory overhead. Overall, LPL provides a general, architecture-agnostic enhancement for latent generative models, improving perceptual quality without requiring specialized data or changes to the denoiser structure, and potentially influencing future latent-diffusion training paradigms.

Abstract

Latent diffusion models (LDMs) power state-of-the-art high-resolution generative image models. LDMs learn the data distribution in the latent space of an autoencoder (AE) and produce images by mapping the generated latents into RGB image space using the AE decoder. While this approach allows for efficient model training and sampling, it induces a disconnect between the training of the diffusion model and the decoder, resulting in a loss of detail in the generated images. To remediate this disconnect, we propose to leverage the internal features of the decoder to define a latent perceptual loss (LPL). This loss encourages the models to create sharper and more realistic images. Our loss can be seamlessly integrated with common autoencoders used in latent diffusion models, and can be applied to different generative modeling paradigms such as DDPM with epsilon and velocity prediction, as well as flow matching. Extensive experiments with models trained on three datasets at 256 and 512 resolution show improved quantitative -- with boosts between 6% and 20% in FID -- and qualitative results when using our perceptual loss.

Paper Structure

This paper contains 20 sections, 12 equations, 31 figures, 2 tables.

Figures (31)

  • Figure 1: Samples from models trained with and without our latent perceptual loss on CC12M. Samples from our model with latent perceptual loss (bottom) have more detail and realistic textures.
  • Figure 2: Overview of our approach. (a) Latent diffusion models compare clean latents and the predicted latents. (b) Our LPL acts in the features of the autoencoder's decoder effectively aligning the diffusion process with the decoder. $F_\beta^e,F_\beta^d$: autoencoder encoder and decoder, $D_\Theta$: denoiser network, CN: cross normalization layer, OD: outlier detection.
  • Figure 3: Summary of the formula for the estimate of the clean image corresponding to the different formulations. Using the following parameterization, $\forall t, {\bf x}_t = \alpha_t {\bf x}_0 + \sigma_t \pmb{\epsilon}_t$.
  • Figure 4: Samples from models trained with and without our latent perceptual loss on S320M. Samples from the model with perceptual loss (bottom row) show more realistic textures and details.
  • Figure 5: Impact of our perceptual loss for models trained on different datasets and resolutions for DDPM-$\epsilon$ model. All models use the same ImageNet-256 pretraining for $600k$ iterations before performing comparing the effect of LPL during post-training. Using LPL boosts FID and CLIP score for all datasets and resolutions considered.
  • ...and 26 more figures