Diffusion Model with Perceptual Loss
Shanchuan Lin, Xiao Yang
TL;DR
This work challenges the view that diffusion model quality under guidance stems primarily from sampling temperature, arguing instead that the loss objective fundamentally shapes the learned distribution. By introducing a self-perceptual objective that uses a frozen MSE-trained diffusion model as the perceptual network, the authors achieve substantially more realistic samples without guidance. Quantitative gains in FID and IS demonstrate the effectiveness of perceptual supervision, though classifier-free guidance remains a strong baseline when paired with MSE. The study highlights the potential of perceptual losses in diffusion training and points to future directions for learned loss objectives and adversarial-style perceptual feedback.
Abstract
Diffusion models without guidance generate very unrealistic samples. Guidance is used ubiquitously, and previous research has attributed its effect to low-temperature sampling that improves quality by trading off diversity. However, this perspective is incomplete. Our research shows that the choice of the loss objective is the underlying reason raw diffusion models fail to generate desirable samples. In this paper, (1) our analysis shows that the loss objective plays an important role in shaping the learned distribution and the MSE loss derived from theories holds assumptions that misalign with data in practice; (2) we explain the effectiveness of guidance methods from a new perspective of perceptual supervision; (3) we validate our hypothesis by training a diffusion model with a novel self-perceptual loss objective and obtaining much more realistic samples without the need for guidance. We hope our work paves the way for future explorations of the diffusion loss objective.
