What happens to diffusion model likelihood when your model is conditional?
Mattias Cross, Anton Ragni
TL;DR
The paper investigates how exact diffusion-model likelihoods behave under conditioning in Text-to-Speech and Text-to-Image tasks. Using probability-flow ODEs and divergence-based likelihoods, it analyzes Grad-TTS and SDXL to quantify how conditioning signals influence likelihoods. The findings show that higher likelihood along the denoising path does not guarantee better intelligibility, semantic alignment, or prompt-faithful generation, highlighting a mismatch between $\log p_0(\mathbf{X}_0)$ and conditioning quality. This work cautions practitioners against a straightforward use of conditional diffusion likelihoods as quality proxies and motivates further theoretical and empirical study of likelihood in conditional diffusion models.
Abstract
Diffusion Models (DMs) iteratively denoise random samples to produce high-quality data. The iterative sampling process is derived from Stochastic Differential Equations (SDEs), allowing a speed-quality trade-off chosen at inference. Another advantage of sampling with differential equations is exact likelihood computation. These likelihoods have been used to rank unconditional DMs and for out-of-domain classification. Despite the many existing and possible uses of DM likelihoods, the distinct properties captured are unknown, especially in conditional contexts such as Text-To-Image (TTI) or Text-To-Speech synthesis (TTS). Surprisingly, we find that TTS DM likelihoods are agnostic to the text input. TTI likelihood is more expressive but cannot discern confounding prompts. Our results show that applying DMs to conditional tasks reveals inconsistencies and strengthens claims that the properties of DM likelihood are unknown. This impact sheds light on the previously unknown nature of DM likelihoods. Although conditional DMs maximise likelihood, the likelihood in question is not as sensitive to the conditioning input as one expects. This investigation provides a new point-of-view on diffusion likelihoods.
