Table of Contents
Fetching ...

One-step Diffusion Models with $f$-Divergence Distribution Matching

Yilun Xu, Weili Nie, Arash Vahdat

TL;DR

This work tackles the inefficiency of diffusion model sampling by introducing f-distill, a general framework for distilling a teacher diffusion model into a one-step student via arbitrary f-divergence distributions. It derives a gradient for D_f(p_t||q_t) that is the teacher-student score difference weighted by a density-ratio–dependent factor, unifying and extending variational score distillation. The framework includes a practical two-stage normalization and GAN-based density-ratio estimation to stabilize training, showing that less mode-seeking divergences, especially Jensen-Shannon, yield state-of-the-art one-step generation on ImageNet-64 and strong zero-shot MS COCO results, with good performance scaling to larger models like SDXL. Overall, f-distill provides a flexible, principled approach to distribution matching in diffusion distillation, enabling faster yet high-fidelity image synthesis with broad applicability.

Abstract

Sampling from diffusion models involves a slow iterative process that hinders their practical deployment, especially for interactive applications. To accelerate generation speed, recent approaches distill a multi-step diffusion model into a single-step student generator via variational score distillation, which matches the distribution of samples generated by the student to the teacher's distribution. However, these approaches use the reverse Kullback-Leibler (KL) divergence for distribution matching which is known to be mode seeking. In this paper, we generalize the distribution matching approach using a novel $f$-divergence minimization framework, termed $f$-distill, that covers different divergences with different trade-offs in terms of mode coverage and training variance. We derive the gradient of the $f$-divergence between the teacher and student distributions and show that it is expressed as the product of their score differences and a weighting function determined by their density ratio. This weighting function naturally emphasizes samples with higher density in the teacher distribution, when using a less mode-seeking divergence. We observe that the popular variational score distillation approach using the reverse-KL divergence is a special case within our framework. Empirically, we demonstrate that alternative $f$-divergences, such as forward-KL and Jensen-Shannon divergences, outperform the current best variational score distillation methods across image generation tasks. In particular, when using Jensen-Shannon divergence, $f$-distill achieves current state-of-the-art one-step generation performance on ImageNet64 and zero-shot text-to-image generation on MS-COCO. Project page: https://research.nvidia.com/labs/genair/f-distill

One-step Diffusion Models with $f$-Divergence Distribution Matching

TL;DR

This work tackles the inefficiency of diffusion model sampling by introducing f-distill, a general framework for distilling a teacher diffusion model into a one-step student via arbitrary f-divergence distributions. It derives a gradient for D_f(p_t||q_t) that is the teacher-student score difference weighted by a density-ratio–dependent factor, unifying and extending variational score distillation. The framework includes a practical two-stage normalization and GAN-based density-ratio estimation to stabilize training, showing that less mode-seeking divergences, especially Jensen-Shannon, yield state-of-the-art one-step generation on ImageNet-64 and strong zero-shot MS COCO results, with good performance scaling to larger models like SDXL. Overall, f-distill provides a flexible, principled approach to distribution matching in diffusion distillation, enabling faster yet high-fidelity image synthesis with broad applicability.

Abstract

Sampling from diffusion models involves a slow iterative process that hinders their practical deployment, especially for interactive applications. To accelerate generation speed, recent approaches distill a multi-step diffusion model into a single-step student generator via variational score distillation, which matches the distribution of samples generated by the student to the teacher's distribution. However, these approaches use the reverse Kullback-Leibler (KL) divergence for distribution matching which is known to be mode seeking. In this paper, we generalize the distribution matching approach using a novel -divergence minimization framework, termed -distill, that covers different divergences with different trade-offs in terms of mode coverage and training variance. We derive the gradient of the -divergence between the teacher and student distributions and show that it is expressed as the product of their score differences and a weighting function determined by their density ratio. This weighting function naturally emphasizes samples with higher density in the teacher distribution, when using a less mode-seeking divergence. We observe that the popular variational score distillation approach using the reverse-KL divergence is a special case within our framework. Empirically, we demonstrate that alternative -divergences, such as forward-KL and Jensen-Shannon divergences, outperform the current best variational score distillation methods across image generation tasks. In particular, when using Jensen-Shannon divergence, -distill achieves current state-of-the-art one-step generation performance on ImageNet64 and zero-shot text-to-image generation on MS-COCO. Project page: https://research.nvidia.com/labs/genair/f-distill

Paper Structure

This paper contains 35 sections, 5 theorems, 16 equations, 19 figures, 10 tables, 1 algorithm.

Key Result

Theorem 1

Let $p$ be the teacher's generative distribution, and let $q$ be a distribution induced by transforming a prior distribution $p({\mathbf{z}})$ through the differentiable mapping $G_\theta$. Assuming $f$ is twice continuously differentiable, then the gradient of $f$-divergence between the two interme where ${\mathbf{z}} \sim p({\mathbf{z}}), \epsilon \sim \mathcal{N}( \mathbf{0}, {\bm{I}})$ and ${

Figures (19)

  • Figure 1: Uncurated generated samples by the 50-step teacher (CFG=8) (a), and one-step student in $f$-distill (b), using same set of prompts on SDXL.
  • Figure 2: The gradient update in $f$-distill is a product of the difference between the teacher and fake scores and a weighting function determined by the chosen $f$-divergence and density ratio. The density ratio is readily available in the auxiliary GAN objective.
  • Figure 3: Score difference and the weighting function on a 2D example. $h$ is the weighting function in forward-KL. Observe that the teacher and fake scores often diverge in lower-density regions (darker colors in the bottom left figure indicate larger score differences), where larger estimation errors occur. The weighting function downweights these regions (lighter colors in the bottom right figure) during gradient updates for $f$-distill.
  • Figure 4: The absolute value of $f'$(a) and weighting function $h(r)$(b) in different $f$-divergences.
  • Figure 5: (a) Normalized variance versus the mean difference between two Gaussians. (b) Training losses of forward-KL w/ and w/o normalizations.
  • ...and 14 more figures

Theorems & Definitions (8)

  • Theorem 1
  • Proposition 1
  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Proposition 1
  • proof