Fast Inference in Denoising Diffusion Models via MMD Finetuning
Emanuele Aiello, Diego Valsesia, Enrico Magli
TL;DR
This work addresses the slow sampling problem of Denoising Diffusion Models by finetuning a pretrained model under a fixed timesteps budget using an unbiased Maximum Mean Discrepancy (MMD) objective applied to perceptual features. The method, MMD-DDM, backprops through the sampling chain to minimize the MMD between real and generated data in a chosen feature space, enabling improved fidelity with far fewer timesteps. Experiments across CIFAR-10, CelebA, ImageNet, and LSUN-Church demonstrate substantial speed-quality gains and competitive or superior results against state-of-the-art accelerated samplers, with performance influenced by the selected feature space (Inception-V3 or CLIP). The approach is fast to fine-tune (minutes to hours), agnostic to the underlying sampling scheme, and readily combinable with existing acceleration strategies, making diffusion-based generation more practical for time-critical deployments.
Abstract
Denoising Diffusion Models (DDMs) have become a popular tool for generating high-quality samples from complex data distributions. These models are able to capture sophisticated patterns and structures in the data, and can generate samples that are highly diverse and representative of the underlying distribution. However, one of the main limitations of diffusion models is the complexity of sample generation, since a large number of inference timesteps is required to faithfully capture the data distribution. In this paper, we present MMD-DDM, a novel method for fast sampling of diffusion models. Our approach is based on the idea of using the Maximum Mean Discrepancy (MMD) to finetune the learned distribution with a given budget of timesteps. This allows the finetuned model to significantly improve the speed-quality trade-off, by substantially increasing fidelity in inference regimes with few steps or, equivalently, by reducing the required number of steps to reach a target fidelity, thus paving the way for a more practical adoption of diffusion models in a wide range of applications. We evaluate our approach on unconditional image generation with extensive experiments across the CIFAR-10, CelebA, ImageNet and LSUN-Church datasets. Our findings show that the proposed method is able to produce high-quality samples in a fraction of the time required by widely-used diffusion models, and outperforms state-of-the-art techniques for accelerated sampling. Code is available at: https://github.com/diegovalsesia/MMD-DDM.
