Boosting Latent Diffusion with Perceptual Objectives
Tariq Berrada, Pietro Astolfi, Melissa Hall, Marton Havasi, Yohann Benchetrit, Adriana Romero-Soriano, Karteek Alahari, Michal Drozdzal, Jakob Verbeek
TL;DR
The paper addresses the decoder–diffusion disconnect in latent diffusion models by introducing Latent Perceptual Loss (LPL), which leverages intermediate features of the autoencoder's decoder to guide training toward sharper, more realistic images. LPL is defined as a multiscale, decoder-feature distance between the decoded original and estimated latents, integrated with the standard diffusion objective as \mathcal{L}_\text{tot} = \mathcal{L}_\text{Diff} + w_\mathrm{LPL} \mathcal{L}_\textrm{LPL} and applicable to DDPM in epsilon/velocity modes as well as Flow-OT. Across ImageNet-1k, CC12M, and S320M at 256 and 512 resolutions, LPL consistently improves FID (roughly 0.5–1.5 points depending on setting) and CLIPScore, with qualitative benefits of finer textures and high-frequency details; ablations show deeper decoder layers and per-channel normalization are beneficial, while the approach incurs modest memory overhead. Overall, LPL provides a general, architecture-agnostic enhancement for latent generative models, improving perceptual quality without requiring specialized data or changes to the denoiser structure, and potentially influencing future latent-diffusion training paradigms.
Abstract
Latent diffusion models (LDMs) power state-of-the-art high-resolution generative image models. LDMs learn the data distribution in the latent space of an autoencoder (AE) and produce images by mapping the generated latents into RGB image space using the AE decoder. While this approach allows for efficient model training and sampling, it induces a disconnect between the training of the diffusion model and the decoder, resulting in a loss of detail in the generated images. To remediate this disconnect, we propose to leverage the internal features of the decoder to define a latent perceptual loss (LPL). This loss encourages the models to create sharper and more realistic images. Our loss can be seamlessly integrated with common autoencoders used in latent diffusion models, and can be applied to different generative modeling paradigms such as DDPM with epsilon and velocity prediction, as well as flow matching. Extensive experiments with models trained on three datasets at 256 and 512 resolution show improved quantitative -- with boosts between 6% and 20% in FID -- and qualitative results when using our perceptual loss.
