Improved Techniques for Maximum Likelihood Estimation for Diffusion ODEs

Kaiwen Zheng; Cheng Lu; Jianfei Chen; Jun Zhu

Improved Techniques for Maximum Likelihood Estimation for Diffusion ODEs

Kaiwen Zheng, Cheng Lu, Jianfei Chen, Jun Zhu

TL;DR

This paper targets likelihood estimation for diffusion ODEs, a family of continuous normalizing flows that admit exact likelihood yet historically lag behind variational methods. It introduces i-DODE, a suite of techniques spanning training (velocity parameterization, high-order flow matching, log-SNR timing, and variance reduction) and evaluation (training-free truncated-normal dequantization with importance sampling) to close the gap. The key contributions include a training-free dequantization that aligns training and testing distributions, a velocity-based flow-matching framework with a second-order regularizer, and an IS strategy that accelerates convergence. Empirically, i-DODE achieves state-of-the-art likelihood on CIFAR-10 and ImageNet-32 without variational dequantization or augmentation (e.g., 2.56 BPD on CIFAR-10 and 3.43/3.69 BPD on ImageNet-32), with further gains when data augmentation is applied, and reports faster convergence and smoother trajectories. Overall, the work provides practical, scalable improvements for density estimation with diffusion ODEs and advances their competitiveness among likelihood-based generative models.

Abstract

Diffusion models have exhibited excellent performance in various domains. The probability flow ordinary differential equation (ODE) of diffusion models (i.e., diffusion ODEs) is a particular case of continuous normalizing flows (CNFs), which enables deterministic inference and exact likelihood evaluation. However, the likelihood estimation results by diffusion ODEs are still far from those of the state-of-the-art likelihood-based generative models. In this work, we propose several improved techniques for maximum likelihood estimation for diffusion ODEs, including both training and evaluation perspectives. For training, we propose velocity parameterization and explore variance reduction techniques for faster convergence. We also derive an error-bounded high-order flow matching objective for finetuning, which improves the ODE likelihood and smooths its trajectory. For evaluation, we propose a novel training-free truncated-normal dequantization to fill the training-evaluation gap commonly existing in diffusion ODEs. Building upon these techniques, we achieve state-of-the-art likelihood estimation results on image datasets (2.56 on CIFAR-10, 3.43/3.69 on ImageNet-32) without variational dequantization or data augmentation, and 2.42 on CIFAR-10 with data augmentation. Code is available at \url{https://github.com/thu-ml/i-DODE}.

Improved Techniques for Maximum Likelihood Estimation for Diffusion ODEs

TL;DR

Abstract

Paper Structure (47 sections, 5 theorems, 101 equations, 11 figures, 4 tables, 1 algorithm)

This paper contains 47 sections, 5 theorems, 101 equations, 11 figures, 4 tables, 1 algorithm.

Introduction
Diffusion Models
Diffusion ODEs and Maximum Likelihood Training
Log-SNR Timed Diffusion Models
Dequantization for Density Estimation
Diffusion ODEs with Truncated-Normal Dequantization
Challenges for Diffusion ODEs with Dequantization
Truncation introduces an additional gap.
Uniform dequantization causes a train-test mismatch.
Training-Free Dequantization by Truncated Normal
Practical Techniques for Improving the Likelihood of Diffusion ODEs
Velocity Parameterization
Error-bounded Second-Order Flow Matching
Timing by Log-SNR and Normalizing Velocity
Variance Reduction with Importance Sampling
...and 32 more sections

Key Result

Theorem 3.1

Suppose we use the truncated-normal dequantization in Eqn. eqn:tn_dequant, then the discrete model distribution has the following variational bound: where

Figures (11)

Figure 1: Test loss curve in the pretraining phase, compared to VDM kingma2021variational. We compute the loss on the test set by the SDE likelihood bound in kingma2021variational.
Figure 2: Training curve from scratch for ablation. We compute the loss on the training set by the SDE likelihood bound in kingma2021variational.
Figure 3: Visualization of importance sampling: (a) The inverse cumulative distribution function $\gamma(t)$ of the proposal distribution $p(\gamma)$, which maps uniform $t$ to importance sampled $\gamma$ (b) The variance of Monte-Carlo estimator $\hbox{Var}\left[\gamma'(t)\mathcal{L}_\theta(\bm{x}_0,\bm{\epsilon},\gamma(t))\right]$ at different noise levels, estimated using 32 data samples $\bm{x}_0$ and 100 noise samples $\bm{\epsilon}$. The peak variance is achieved around $\gamma=-11.2$.
Figure 4: The likelihood evaluation results under uniform dequantization for different start times $\gamma_\epsilon$. To plot the curve, we estimate the likelihood using the first 1024 test samples for CIFAR-10, and the first 512 test samples for ImageNet-32.
Figure 5: Illustration of velocity prediction and imbalance problem.
...and 6 more figures

Theorems & Definitions (11)

Theorem 3.1: Variational Bound under Truncated-Normal Dequantization
Corollary 3.2: Importance Weighted Variational Bound under Truncated-Normal Dequantization
Remark 3.3
Theorem 4.1
Remark 1.1
Remark 1.2
Theorem 2.1
proof
Lemma 6.1
proof
...and 1 more

Improved Techniques for Maximum Likelihood Estimation for Diffusion ODEs

TL;DR

Abstract

Improved Techniques for Maximum Likelihood Estimation for Diffusion ODEs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (11)