Table of Contents
Fetching ...

Diffusion Models are Minimax Optimal Distribution Estimators

Kazusato Oko, Shunta Akiyama, Taiji Suzuki

TL;DR

This work establishes a statistical learning theory for diffusion-based distribution estimation by proving nearly minimax estimation rates in TV and W1 when the true density lies in a Besov space and the score is learned via neural networks. The authors introduce a diffused B-spline basis to approximate the score and convert approximation error into estimation error, yielding explicit rates and network-size bounds. They further show that diffusion models adapt to intrinsic dimensionality, avoiding the curse of dimensionality under a manifold assumption, and they propose a score-network switching scheme to tighten Wasserstein-rate bounds. Overall, the paper provides rigorous guarantees for diffusion models as distribution estimators and highlights practical strategies for achieving optimal generalization in high-dimensional settings.

Abstract

While efficient distribution learning is no doubt behind the groundbreaking success of diffusion modeling, its theoretical guarantees are quite limited. In this paper, we provide the first rigorous analysis on approximation and generalization abilities of diffusion modeling for well-known function spaces. The highlight of this paper is that when the true density function belongs to the Besov space and the empirical score matching loss is properly minimized, the generated data distribution achieves the nearly minimax optimal estimation rates in the total variation distance and in the Wasserstein distance of order one. Furthermore, we extend our theory to demonstrate how diffusion models adapt to low-dimensional data distributions. We expect these results advance theoretical understandings of diffusion modeling and its ability to generate verisimilar outputs.

Diffusion Models are Minimax Optimal Distribution Estimators

TL;DR

This work establishes a statistical learning theory for diffusion-based distribution estimation by proving nearly minimax estimation rates in TV and W1 when the true density lies in a Besov space and the score is learned via neural networks. The authors introduce a diffused B-spline basis to approximate the score and convert approximation error into estimation error, yielding explicit rates and network-size bounds. They further show that diffusion models adapt to intrinsic dimensionality, avoiding the curse of dimensionality under a manifold assumption, and they propose a score-network switching scheme to tighten Wasserstein-rate bounds. Overall, the paper provides rigorous guarantees for diffusion models as distribution estimators and highlights practical strategies for achieving optimal generalization in high-dimensional settings.

Abstract

While efficient distribution learning is no doubt behind the groundbreaking success of diffusion modeling, its theoretical guarantees are quite limited. In this paper, we provide the first rigorous analysis on approximation and generalization abilities of diffusion modeling for well-known function spaces. The highlight of this paper is that when the true density function belongs to the Besov space and the empirical score matching loss is properly minimized, the generated data distribution achieves the nearly minimax optimal estimation rates in the total variation distance and in the Wasserstein distance of order one. Furthermore, we extend our theory to demonstrate how diffusion models adapt to low-dimensional data distributions. We expect these results advance theoretical understandings of diffusion modeling and its ability to generate verisimilar outputs.
Paper Structure (80 sections, 101 theorems, 591 equations)

This paper contains 80 sections, 101 theorems, 591 equations.

Key Result

Theorem 3.1

There exists a neural network $\phi_{{\rm score}}\in \Phi(L,W,S,B)$ that satisfies, for all $t\in [\underline{T},\overline{T}]$, Here, $L,W,S$ and $B$ are evaluated as $L = \mathcal{O} (\log^4 N),\| W\|_\infty = \mathcal{O} (N\log^6N),S = \mathcal{O} (N\log^8N),$ and $B = \exp(\mathcal{O}(\log^4 N )).$ Moreover, we can take $\phi_{{\rm score}}$ satisfying $\|\phi_{{\rm score}}(\cdot,t)\|_\infty =

Theorems & Definitions (189)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3: Besov space $B_{p,q}^s(\Omega)$
  • Theorem 3.1
  • Lemma 3.2: Informal version of \ref{['Lemma:SuzukiBesov']}; suzuki2018adaptivity
  • Lemma 3.3: See also \ref{['Lemma:MandSigma']}
  • Lemma 3.4
  • Lemma 3.5
  • Lemma 3.6
  • Lemma 4.1
  • ...and 179 more