Table of Contents
Fetching ...

Diffusion Model with Perceptual Loss

Shanchuan Lin, Xiao Yang

TL;DR

This work challenges the view that diffusion model quality under guidance stems primarily from sampling temperature, arguing instead that the loss objective fundamentally shapes the learned distribution. By introducing a self-perceptual objective that uses a frozen MSE-trained diffusion model as the perceptual network, the authors achieve substantially more realistic samples without guidance. Quantitative gains in FID and IS demonstrate the effectiveness of perceptual supervision, though classifier-free guidance remains a strong baseline when paired with MSE. The study highlights the potential of perceptual losses in diffusion training and points to future directions for learned loss objectives and adversarial-style perceptual feedback.

Abstract

Diffusion models without guidance generate very unrealistic samples. Guidance is used ubiquitously, and previous research has attributed its effect to low-temperature sampling that improves quality by trading off diversity. However, this perspective is incomplete. Our research shows that the choice of the loss objective is the underlying reason raw diffusion models fail to generate desirable samples. In this paper, (1) our analysis shows that the loss objective plays an important role in shaping the learned distribution and the MSE loss derived from theories holds assumptions that misalign with data in practice; (2) we explain the effectiveness of guidance methods from a new perspective of perceptual supervision; (3) we validate our hypothesis by training a diffusion model with a novel self-perceptual loss objective and obtaining much more realistic samples without the need for guidance. We hope our work paves the way for future explorations of the diffusion loss objective.

Diffusion Model with Perceptual Loss

TL;DR

This work challenges the view that diffusion model quality under guidance stems primarily from sampling temperature, arguing instead that the loss objective fundamentally shapes the learned distribution. By introducing a self-perceptual objective that uses a frozen MSE-trained diffusion model as the perceptual network, the authors achieve substantially more realistic samples without guidance. Quantitative gains in FID and IS demonstrate the effectiveness of perceptual supervision, though classifier-free guidance remains a strong baseline when paired with MSE. The study highlights the potential of perceptual losses in diffusion training and points to future directions for learned loss objectives and adversarial-style perceptual feedback.

Abstract

Diffusion models without guidance generate very unrealistic samples. Guidance is used ubiquitously, and previous research has attributed its effect to low-temperature sampling that improves quality by trading off diversity. However, this perspective is incomplete. Our research shows that the choice of the loss objective is the underlying reason raw diffusion models fail to generate desirable samples. In this paper, (1) our analysis shows that the loss objective plays an important role in shaping the learned distribution and the MSE loss derived from theories holds assumptions that misalign with data in practice; (2) we explain the effectiveness of guidance methods from a new perspective of perceptual supervision; (3) we validate our hypothesis by training a diffusion model with a novel self-perceptual loss objective and obtaining much more realistic samples without the need for guidance. We hope our work paves the way for future explorations of the diffusion loss objective.
Paper Structure (21 sections, 7 equations, 7 figures, 7 tables)

This paper contains 21 sections, 7 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The diffusion model trained with MSE loss generates unrealistic samples without guidance (top row). Our proposed self-perceptual loss can generate realistic samples without guidance. The loss objective is important in shaping the learned distribution.
  • Figure 2: A motivating example where data samples are given and the actual distribution is unknown. Diffusion models learn the maximum likelihood estimation (MLE) distribution as the target. Neural networks create smoothness and generalization. The loss objective influences the shape of the learned distribution and can be designed with inductive biases to better drive it toward the actual distribution.
  • Figure 3: The midpoint sample is derived by minimizing the distance to known samples by the given distance function. MSE midpoint is out-of-distribution.
  • Figure 4: Text-to-image generation on DrawBench prompts saharia2022photorealistic. Our self-perceptual objective improves sample quality over the MSE objective while largely maintaining the image content and layout. Classifier-free guidance has the additional effect of enhancing text alignment by sacrificing sample diversity. Images are generated with DDIM 50 NFEs. More analysis in \ref{['sec:evaluation-qualitative']}.
  • Figure 5: Text-to-image generation on DrawBench prompts saharia2022photorealistic. Our self-perceptual objective improves sample quality over the vanilla MSE objective while largely maintaining the image content and layout. Classifier-free guidance has the additional effect of enhancing text alignment by sacrificing sample diversity. Images are generated with DDIM 50 NFEs. More analysis in \ref{['sec:evaluation-qualitative']}.
  • ...and 2 more figures