Improving Diffusion-Based Generative Models via Approximated Optimal Transport

Daegyu Kim; Jooyoung Choi; Chaehun Shin; Uiwon Hwang; Sungroh Yoon

Improving Diffusion-Based Generative Models via Approximated Optimal Transport

Daegyu Kim, Jooyoung Choi, Chaehun Shin, Uiwon Hwang, Sungroh Yoon

TL;DR

This work tackles the high curvature and truncation errors limiting diffusion-based image synthesis by introducing Approximated Optimal Transport (AOT), a training scheme that approximates optimal transport via Hungarian-assignment to pair images with informative noise. By reducing the information entropy of the training targets, AOT yields straighter, lower-curvature ODE trajectories and enables high-quality generation with far fewer function evaluations, demonstrated on CIFAR-10 with CIFAR-10 results of $\text{FID}=1.88$ at 27 NFEs uncond. and $1.73$ at 29 NFEs cond., with further gains to $1.68$ and $1.58$ under Discriminator Guidance. The method also integrates with DG by training the discriminator on AOT-synthesized pairs, achieving state-of-the-art FID scores at 29 NFEs. Overall, AOT offers a training-centered path to reduce sampling costs while maintaining or improving image quality, with configurable GPU-memory strategies and potential for extension to conditional guidance beyond images.

Abstract

We introduce the Approximated Optimal Transport (AOT) technique, a novel training scheme for diffusion-based generative models. Our approach aims to approximate and integrate optimal transport into the training process, significantly enhancing the ability of diffusion models to estimate the denoiser outputs accurately. This improvement leads to ODE trajectories of diffusion models with lower curvature and reduced truncation errors during sampling. We achieve superior image quality and reduced sampling steps by employing AOT in training. Specifically, we achieve FID scores of 1.88 with just 27 NFEs and 1.73 with 29 NFEs in unconditional and conditional generations, respectively. Furthermore, when applying AOT to train the discriminator for guidance, we establish new state-of-the-art FID scores of 1.68 and 1.58 for unconditional and conditional generations, respectively, each with 29 NFEs. This outcome demonstrates the effectiveness of AOT in enhancing the performance of diffusion models.

Improving Diffusion-Based Generative Models via Approximated Optimal Transport

TL;DR

at 27 NFEs uncond. and

at 29 NFEs cond., with further gains to

and

under Discriminator Guidance. The method also integrates with DG by training the discriminator on AOT-synthesized pairs, achieving state-of-the-art FID scores at 29 NFEs. Overall, AOT offers a training-centered path to reduce sampling costs while maintaining or improving image quality, with configurable GPU-memory strategies and potential for extension to conditional guidance beyond images.

Abstract

Paper Structure (28 sections, 7 equations, 11 figures, 5 tables)

This paper contains 28 sections, 7 equations, 11 figures, 5 tables.

Introduction
Preliminaries
Diffusion and Score Models
EDM
Model Training
ODE Curvature Scheduling
Efficient Sampling
Optimal Transport for ODE-Based Models
Hungarian Algorithm
Assignment Problem
Hungarian Algorithm
Motivation
Approximated Optimal Transport Training
Training Process using AOT
AOT Implementations
...and 13 more sections

Figures (11)

Figure 1: Comparison of FID scores and the corresponding number of function evaluations (NFEs) for CIFAR-10 image unconditional and conditional generations in baseline studies and EDM-AOT. This graph demonstrates the superior performance of our approach in terms of image quality and reduced NFE compared to the baseline.
Figure 2: The images are generated using the unconditional EDM model with a single Euler step at each noisy image during the sampling process, proceeding from left to right. The images exhibit consistency, especially noticeable at low noise levels.
Figure 3: Synthesized CIFAR-10 images in an unconditional generation. (a) Denoised images $\mathbf{x}_0$ with EDM and its 35 NFEs sampler. (b) Denoised images $\mathbf{x}_0$ using the EDM model and single-step Euler method. (c) Denoised images $\mathbf{x}_0$ using the EDM-AOT model and single-step Euler method. The denoised images synthesized by EDM-AOT exhibit greater diversity than those synthesized by EDM.
Figure 4: FID scores of CIFAR-10 images across varying $\rho$ values and model configurations. The models employing AOT exhibited consistent performance even as $\rho$ increased. (a) FID scores in unconditional generation. (b) FID scores in conditional generation. (c) FID scores in unconditional generation, with 8 steps (15 NFEs).
Figure 5: Variations in FID scores of CIFAR-10 images across different step counts. We sample images using four different $\rho$ values: 9, 27, 81, and 243, and select the optimal configuration. The dots on the graph represent points with minimal steps among those achieving the lowest FID score. (a) FID scores in unconditional generation. (b) FID scores in conditional generation.
...and 6 more figures

Improving Diffusion-Based Generative Models via Approximated Optimal Transport

TL;DR

Abstract

Improving Diffusion-Based Generative Models via Approximated Optimal Transport

Authors

TL;DR

Abstract

Table of Contents

Figures (11)