Table of Contents
Fetching ...

What happens to diffusion model likelihood when your model is conditional?

Mattias Cross, Anton Ragni

TL;DR

The paper investigates how exact diffusion-model likelihoods behave under conditioning in Text-to-Speech and Text-to-Image tasks. Using probability-flow ODEs and divergence-based likelihoods, it analyzes Grad-TTS and SDXL to quantify how conditioning signals influence likelihoods. The findings show that higher likelihood along the denoising path does not guarantee better intelligibility, semantic alignment, or prompt-faithful generation, highlighting a mismatch between $\log p_0(\mathbf{X}_0)$ and conditioning quality. This work cautions practitioners against a straightforward use of conditional diffusion likelihoods as quality proxies and motivates further theoretical and empirical study of likelihood in conditional diffusion models.

Abstract

Diffusion Models (DMs) iteratively denoise random samples to produce high-quality data. The iterative sampling process is derived from Stochastic Differential Equations (SDEs), allowing a speed-quality trade-off chosen at inference. Another advantage of sampling with differential equations is exact likelihood computation. These likelihoods have been used to rank unconditional DMs and for out-of-domain classification. Despite the many existing and possible uses of DM likelihoods, the distinct properties captured are unknown, especially in conditional contexts such as Text-To-Image (TTI) or Text-To-Speech synthesis (TTS). Surprisingly, we find that TTS DM likelihoods are agnostic to the text input. TTI likelihood is more expressive but cannot discern confounding prompts. Our results show that applying DMs to conditional tasks reveals inconsistencies and strengthens claims that the properties of DM likelihood are unknown. This impact sheds light on the previously unknown nature of DM likelihoods. Although conditional DMs maximise likelihood, the likelihood in question is not as sensitive to the conditioning input as one expects. This investigation provides a new point-of-view on diffusion likelihoods.

What happens to diffusion model likelihood when your model is conditional?

TL;DR

The paper investigates how exact diffusion-model likelihoods behave under conditioning in Text-to-Speech and Text-to-Image tasks. Using probability-flow ODEs and divergence-based likelihoods, it analyzes Grad-TTS and SDXL to quantify how conditioning signals influence likelihoods. The findings show that higher likelihood along the denoising path does not guarantee better intelligibility, semantic alignment, or prompt-faithful generation, highlighting a mismatch between and conditioning quality. This work cautions practitioners against a straightforward use of conditional diffusion likelihoods as quality proxies and motivates further theoretical and empirical study of likelihood in conditional diffusion models.

Abstract

Diffusion Models (DMs) iteratively denoise random samples to produce high-quality data. The iterative sampling process is derived from Stochastic Differential Equations (SDEs), allowing a speed-quality trade-off chosen at inference. Another advantage of sampling with differential equations is exact likelihood computation. These likelihoods have been used to rank unconditional DMs and for out-of-domain classification. Despite the many existing and possible uses of DM likelihoods, the distinct properties captured are unknown, especially in conditional contexts such as Text-To-Image (TTI) or Text-To-Speech synthesis (TTS). Surprisingly, we find that TTS DM likelihoods are agnostic to the text input. TTI likelihood is more expressive but cannot discern confounding prompts. Our results show that applying DMs to conditional tasks reveals inconsistencies and strengthens claims that the properties of DM likelihood are unknown. This impact sheds light on the previously unknown nature of DM likelihoods. Although conditional DMs maximise likelihood, the likelihood in question is not as sensitive to the conditioning input as one expects. This investigation provides a new point-of-view on diffusion likelihoods.
Paper Structure (13 sections, 13 equations, 3 figures, 7 tables)

This paper contains 13 sections, 13 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Example data from CLEVR ($a$) and PACS ($b$)
  • Figure 2: Grad-TTS encoder output ($a$) and decoder output ($b$). The encoder produces intelligible spectrograms, and the decoder removes distortion and encourages speaker characteristics. The fact that the encoder output is similar to distorted spectrograms is core to the unsupervised domain adaptation method in Table \ref{['tab:uda']} where blurry spectrograms are treated as input to the diffusion decoder.
  • Figure 3: A source image ($a$) and a reconstructed image ($b$), with caption "panda eating cake". The panda and orientation are preserved but the domain has changed. The cake has been absorbed.