Table of Contents
Fetching ...

Algorithm- and Data-Dependent Generalization Bounds for Diffusion Models

Benjamin Dupuis, Dario Shariatian, Maxime Haddouche, Alain Durmus, Umut Simsekli

TL;DR

The paper tackles diffusion-model generalization by deriving algorithm- and data-dependent bounds that explicitly incorporate optimization dynamics. It introduces a score-estimation decomposition $\varepsilon_{ ext{s}}^{(n)}(\theta) = \mathscr{L}_{\mathrm{ESM}}^{(n)}(\theta) + \Delta_{\mathrm{s}}^{(n)} + \mathscr{G}_{\mathrm{l}}^{(n)}(\theta)$ and shows how these components bound the KL divergence between the true data distribution and the generated distribution, yielding an overall rate of $\mathcal{O}(n^{-1/2})$ under realistic conditions. The work further links generalization to gradient norms and optimization trajectories through SGLD and topological bounds for ADAM, and supports the theory with extensive experiments on low- and high-dimensional data, including image datasets. This framework sheds light on why SGMs generalize well in practice and provides actionable metrics to study and predict diffusion-model performance.

Abstract

Score-based generative models (SGMs) have emerged as one of the most popular classes of generative models. A substantial body of work now exists on the analysis of SGMs, focusing either on discretization aspects or on their statistical performance. In the latter case, bounds have been derived, under various metrics, between the true data distribution and the distribution induced by the SGM, often demonstrating polynomial convergence rates with respect to the number of training samples. However, these approaches adopt a largely approximation theory viewpoint, which tends to be overly pessimistic and relatively coarse. In particular, they fail to fully explain the empirical success of SGMs or capture the role of the optimization algorithm used in practice to train the score network. To support this observation, we first present simple experiments illustrating the concrete impact of optimization hyperparameters on the generalization ability of the generated distribution. Then, this paper aims to bridge this theoretical gap by providing the first algorithmic- and data-dependent generalization analysis for SGMs. In particular, we establish bounds that explicitly account for the optimization dynamics of the learning algorithm, offering new insights into the generalization behavior of SGMs. Our theoretical findings are supported by empirical results on several datasets.

Algorithm- and Data-Dependent Generalization Bounds for Diffusion Models

TL;DR

The paper tackles diffusion-model generalization by deriving algorithm- and data-dependent bounds that explicitly incorporate optimization dynamics. It introduces a score-estimation decomposition and shows how these components bound the KL divergence between the true data distribution and the generated distribution, yielding an overall rate of under realistic conditions. The work further links generalization to gradient norms and optimization trajectories through SGLD and topological bounds for ADAM, and supports the theory with extensive experiments on low- and high-dimensional data, including image datasets. This framework sheds light on why SGMs generalize well in practice and provides actionable metrics to study and predict diffusion-model performance.

Abstract

Score-based generative models (SGMs) have emerged as one of the most popular classes of generative models. A substantial body of work now exists on the analysis of SGMs, focusing either on discretization aspects or on their statistical performance. In the latter case, bounds have been derived, under various metrics, between the true data distribution and the distribution induced by the SGM, often demonstrating polynomial convergence rates with respect to the number of training samples. However, these approaches adopt a largely approximation theory viewpoint, which tends to be overly pessimistic and relatively coarse. In particular, they fail to fully explain the empirical success of SGMs or capture the role of the optimization algorithm used in practice to train the score network. To support this observation, we first present simple experiments illustrating the concrete impact of optimization hyperparameters on the generalization ability of the generated distribution. Then, this paper aims to bridge this theoretical gap by providing the first algorithmic- and data-dependent generalization analysis for SGMs. In particular, we establish bounds that explicitly account for the optimization dynamics of the learning algorithm, offering new insights into the generalization behavior of SGMs. Our theoretical findings are supported by empirical results on several datasets.

Paper Structure

This paper contains 32 sections, 15 theorems, 97 equations, 6 figures, 2 tables.

Key Result

Theorem 2.1

Under ass:fisher-constant-step-size, for any $h >0$ and $N \in \mathbb{N}$ such that $T= hN$, it holds where $\varepsilon_{\mathrm{s}}^{(n)} (\theta) := T^{-1} \sum_{k=0}^{N-1} h \mathbb{E}_{} \left[ \Vert s_{\theta}(T - t_k, \overrightarrow{X}_{T - t_k}) - 2 \nabla \log \Tilde{p}_{T - t_k} (\overrightarrow{X}_{T - t_k}) \Vert^2 \right].$

Figures (6)

  • Figure 1: Experiments with varying learning rates and batch sizes obtained with the ADAM optimizer. (left) test Wasserstein-2$\downarrow$ metric on a Gaussian mixture dataset (middle) FID$\downarrow$ on MNIST(right) FID$\downarrow$ on the butterflies dataset Wang09. See \ref{['sec:experimental-setup']} for full experimental details.
  • Figure 2: SGLD optimizer \ref{['eq:sgld-definition']} on a low dimensional Gaussian mixture dataset, for different value of the temprature ($1/\beta$). We use full batch size, constant learning rate $\eta$, a grid of values of $(n,\eta)$ and $10$ random seeds. $x$-axis: Value of $\sqrt{ \eta \beta \langle \Vert \widehat{g}_k^2 \Vert \rangle / n}$. $y$-axis: Score generalization gap.
  • Figure 3: ADAM optimizer on the butterflies dataset (left) and the flowers dataset (right). Generalization gap vs. several complexity metrics: $b\langle\Vert\widehat{g}_k\Vert^2\rangle$(top left), $E^1({\mathcal{W}}^{(n)})$(top right), $\mathrm{PMag}(10^{-2}\cdot {\mathcal{W}}^{(n)})$(bottom left) and $\mathrm{PMag}(\sqrt{n} \cdot {\mathcal{W}}^{(n)})$(bottom right).
  • Figure 4: butterflies samples after 200k training steps, 500 sampling steps.
  • Figure 5: ADAM optimizer on MNIST dataset. Score generalization gap vs. several complexity metrics: $b\langle\Vert\widehat{g}_k\Vert^2\rangle$(top left), $E^1({\mathcal{W}}^{(n)})$(top right), $\mathrm{PMag}(10^{-2}\cdot {\mathcal{W}}^{(n)})$(bottom left) and $\mathrm{PMag}(\sqrt{n} \cdot {\mathcal{W}}^{(n)})$(bottom right).
  • ...and 1 more figures

Theorems & Definitions (31)

  • Theorem 2.1
  • Theorem 3.1
  • Lemma 3.1
  • Lemma 3.2
  • Proposition 3.1
  • Theorem 4.1: mou2018generalization
  • Theorem 4.2: andreeva2024topological
  • Lemma A.1
  • proof
  • Lemma A.2
  • ...and 21 more