Fast Inference in Denoising Diffusion Models via MMD Finetuning

Emanuele Aiello; Diego Valsesia; Enrico Magli

Fast Inference in Denoising Diffusion Models via MMD Finetuning

Emanuele Aiello, Diego Valsesia, Enrico Magli

TL;DR

This work addresses the slow sampling problem of Denoising Diffusion Models by finetuning a pretrained model under a fixed timesteps budget using an unbiased Maximum Mean Discrepancy (MMD) objective applied to perceptual features. The method, MMD-DDM, backprops through the sampling chain to minimize the MMD between real and generated data in a chosen feature space, enabling improved fidelity with far fewer timesteps. Experiments across CIFAR-10, CelebA, ImageNet, and LSUN-Church demonstrate substantial speed-quality gains and competitive or superior results against state-of-the-art accelerated samplers, with performance influenced by the selected feature space (Inception-V3 or CLIP). The approach is fast to fine-tune (minutes to hours), agnostic to the underlying sampling scheme, and readily combinable with existing acceleration strategies, making diffusion-based generation more practical for time-critical deployments.

Abstract

Denoising Diffusion Models (DDMs) have become a popular tool for generating high-quality samples from complex data distributions. These models are able to capture sophisticated patterns and structures in the data, and can generate samples that are highly diverse and representative of the underlying distribution. However, one of the main limitations of diffusion models is the complexity of sample generation, since a large number of inference timesteps is required to faithfully capture the data distribution. In this paper, we present MMD-DDM, a novel method for fast sampling of diffusion models. Our approach is based on the idea of using the Maximum Mean Discrepancy (MMD) to finetune the learned distribution with a given budget of timesteps. This allows the finetuned model to significantly improve the speed-quality trade-off, by substantially increasing fidelity in inference regimes with few steps or, equivalently, by reducing the required number of steps to reach a target fidelity, thus paving the way for a more practical adoption of diffusion models in a wide range of applications. We evaluate our approach on unconditional image generation with extensive experiments across the CIFAR-10, CelebA, ImageNet and LSUN-Church datasets. Our findings show that the proposed method is able to produce high-quality samples in a fraction of the time required by widely-used diffusion models, and outperforms state-of-the-art techniques for accelerated sampling. Code is available at: https://github.com/diegovalsesia/MMD-DDM.

Fast Inference in Denoising Diffusion Models via MMD Finetuning

TL;DR

Abstract

Paper Structure (26 sections, 6 equations, 10 figures, 9 tables)

This paper contains 26 sections, 6 equations, 10 figures, 9 tables.

Introduction
Background and Related Work
Denoising Diffusion Models
Accelerated Sampling for DDMs
MMD in Generative Models
Method
Overview
Finetuning with MMD
Perceptually-Relevant Feature Spaces
Experiments
Setting
Datasets
Models and Sampling
Evaluation
Implementation Details
...and 11 more sections

Figures (10)

Figure 1: Generated samples for CelebA (top) and CIFAR-10 (bottom). The samples are obtained using 5 timesteps with the DDIM sampling procedure. Results from standard DDIM (left), the same model finetuned using MMD with Inception-V3 features (center-left) and CLIP features (center-right), reference images from the dataset (right). Samples are not cherry-picked. Finetuning improves details clarity and sharpness, occasionally introducing semantic changes.
Figure 2: Generated samples for LSUN-Church (top) and ImageNet (bottom). The samples are obtained using 5 timesteps for LSUN-Church and 10 timesteps for ImageNet, with the DDIM sampling procedure. Results from Standard DDIM (left), the same model finetuned using Inception-V3 features (center-left) and CLIP features (center-right), reference images from the dataset (right). Samples are not cherry-picked.
Figure 3: Generated samples by the DDIM model (top) and the finetuned model (bottom) for CelebA. For each generated samples we visualize the top-4 nearest neighbours.
Figure 4: Generated samples for CIFAR-10. The samples are obtained using 5, 10, 15 and 20 timesteps with the DDIM sampling procedure. Results from Standard DDIM (left), the same model finetuned using Inception-V3 features (center) and CLIP features (right).
Figure 5: Generated samples for CelebA. The samples are obtained using 5, 10, 15 and 20 timesteps with the DDIM sampling procedure. Results from Standard DDIM (left), the same model finetuned using Inception-V3 features (center) and CLIP features (right).
...and 5 more figures

Fast Inference in Denoising Diffusion Models via MMD Finetuning

TL;DR

Abstract

Fast Inference in Denoising Diffusion Models via MMD Finetuning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)