Table of Contents
Fetching ...

Maximum Likelihood Training of Score-Based Diffusion Models

Yang Song, Conor Durkan, Iain Murray, Stefano Ermon

TL;DR

The work addresses improving the likelihoods of score-based diffusion methods by deriving a likelihood-weighted objective that upper-bounds the negative log-likelihood, enabling approximate maximum likelihood training with efficiency comparable to score matching. It connects diffusion processes to continuous normalizing flows, provides KL-based and per-datapoint bounds, and introduces variance-reduction and variational dequantization to boost likelihoods. Empirically, likelihood weighting (with importance sampling and variational dequantization) yields consistent likelihood improvements across SDE families on CIFAR-10 and ImageNet-32×32, achieving 2.83 and 3.76 bits/dim and competitive sample-quality trade-offs. The results position score-based diffusion methods as competitive with normalizing flows for tractable likelihood, while highlighting limitations like slower sampling and uncertain transfer to discrete data.

Abstract

Score-based diffusion models synthesize samples by reversing a stochastic process that diffuses data to noise, and are trained by minimizing a weighted combination of score matching losses. The log-likelihood of score-based diffusion models can be tractably computed through a connection to continuous normalizing flows, but log-likelihood is not directly optimized by the weighted combination of score matching losses. We show that for a specific weighting scheme, the objective upper bounds the negative log-likelihood, thus enabling approximate maximum likelihood training of score-based diffusion models. We empirically observe that maximum likelihood training consistently improves the likelihood of score-based diffusion models across multiple datasets, stochastic processes, and model architectures. Our best models achieve negative log-likelihoods of 2.83 and 3.76 bits/dim on CIFAR-10 and ImageNet 32x32 without any data augmentation, on a par with state-of-the-art autoregressive models on these tasks.

Maximum Likelihood Training of Score-Based Diffusion Models

TL;DR

The work addresses improving the likelihoods of score-based diffusion methods by deriving a likelihood-weighted objective that upper-bounds the negative log-likelihood, enabling approximate maximum likelihood training with efficiency comparable to score matching. It connects diffusion processes to continuous normalizing flows, provides KL-based and per-datapoint bounds, and introduces variance-reduction and variational dequantization to boost likelihoods. Empirically, likelihood weighting (with importance sampling and variational dequantization) yields consistent likelihood improvements across SDE families on CIFAR-10 and ImageNet-32×32, achieving 2.83 and 3.76 bits/dim and competitive sample-quality trade-offs. The results position score-based diffusion methods as competitive with normalizing flows for tractable likelihood, while highlighting limitations like slower sampling and uncertain transfer to discrete data.

Abstract

Score-based diffusion models synthesize samples by reversing a stochastic process that diffuses data to noise, and are trained by minimizing a weighted combination of score matching losses. The log-likelihood of score-based diffusion models can be tractably computed through a connection to continuous normalizing flows, but log-likelihood is not directly optimized by the weighted combination of score matching losses. We show that for a specific weighting scheme, the objective upper bounds the negative log-likelihood, thus enabling approximate maximum likelihood training of score-based diffusion models. We empirically observe that maximum likelihood training consistently improves the likelihood of score-based diffusion models across multiple datasets, stochastic processes, and model architectures. Our best models achieve negative log-likelihoods of 2.83 and 3.76 bits/dim on CIFAR-10 and ImageNet 32x32 without any data augmentation, on a par with state-of-the-art autoregressive models on these tasks.

Paper Structure

This paper contains 27 sections, 11 theorems, 63 equations, 4 figures, 3 tables.

Key Result

Theorem 1

Let $p({\mathbf{x}})$ be the data distribution, $\pi({\mathbf{x}})$ be a known prior distribution, and $p_{\bm{\theta}}^{\textnormal{SDE}}$ be defined as in sec:likelihood. Suppose $\{{\mathbf{x}}(t)\}_{t\in[0,T]}$ is a stochastic process defined by the SDE in eq:sde with ${\mathbf{x}}(0) \sim p$, w

Figures (4)

  • Figure 1: We can use an SDE to diffuse data to a simple noise distribution. This SDE can be reversed once we know the score of the marginal distribution at each intermediate time step, $\nabla_{\mathbf{x}} \log p_t({\mathbf{x}})$.
  • Figure 2: Learning curves with the likelihood weighting on the CIFAR-10 dataset (smoothed with exponential moving average). Importance sampling significantly reduces the loss variance.
  • Figure 3: Samples on CIFAR-10. (a) Model with the best FID. (b) ScoreFlow trained with likelihood weighting + importance sampling + VP SDE. Samples of both models are generated with the same random seed.
  • Figure 4: Samples on ImageNet $32\times 32$. (a) Model with the best FID. (b) ScoreFlow trained with likelihood weighting + importance sampling + VP SDE. Samples of both models are generated with the same random seed.

Theorems & Definitions (21)

  • Theorem 1
  • proof : Sketch of proof
  • Corollary 1
  • Theorem 2
  • proof : Sketch of proof
  • Theorem 3
  • proof : Sketch of proof
  • Theorem 3
  • proof
  • Theorem 3
  • ...and 11 more