Table of Contents
Fetching ...

Provable Statistical Rates for Consistency Diffusion Models

Zehao Dou, Minshuo Chen, Mengdi Wang, Zhuoran Yang

TL;DR

This work provides the first statistical theory for consistency diffusion models, treating their training as a Wasserstein-1 discrepancy minimization problem that unifies distillation and isolation training. By connecting consistency models to a baseline DDPM solver and carefully decomposing sources of error (score estimation, discretization, and empirical-population gaps), the authors derive nonparametric sample-complexity bounds that match vanilla diffusion models. Under mild tail and Lipschitz assumptions, the distillation approach achieves a nearly optimal rate of $\\widetilde{\mathcal{O}}(n^{-1/(2(d+5))})$, while the isolation approach yields $\\widetilde{\mathcal{O}}(n^{-1/d})$ under a bounded-support assumption, both confirming that speedups in sampling do not substantially compromise distributional accuracy. The results provide a principled foundation for deploying fast, statistically principled consistency diffusion models in practice and offer guidance on score-estimation quality, discretization granularity, and dataset size.

Abstract

Diffusion models have revolutionized various application domains, including computer vision and audio generation. Despite the state-of-the-art performance, diffusion models are known for their slow sample generation due to the extensive number of steps involved. In response, consistency models have been developed to merge multiple steps in the sampling process, thereby significantly boosting the speed of sample generation without compromising quality. This paper contributes towards the first statistical theory for consistency models, formulating their training as a distribution discrepancy minimization problem. Our analysis yields statistical estimation rates based on the Wasserstein distance for consistency models, matching those of vanilla diffusion models. Additionally, our results encompass the training of consistency models through both distillation and isolation methods, demystifying their underlying advantage.

Provable Statistical Rates for Consistency Diffusion Models

TL;DR

This work provides the first statistical theory for consistency diffusion models, treating their training as a Wasserstein-1 discrepancy minimization problem that unifies distillation and isolation training. By connecting consistency models to a baseline DDPM solver and carefully decomposing sources of error (score estimation, discretization, and empirical-population gaps), the authors derive nonparametric sample-complexity bounds that match vanilla diffusion models. Under mild tail and Lipschitz assumptions, the distillation approach achieves a nearly optimal rate of , while the isolation approach yields under a bounded-support assumption, both confirming that speedups in sampling do not substantially compromise distributional accuracy. The results provide a principled foundation for deploying fast, statistically principled consistency diffusion models in practice and offer guidance on score-estimation quality, discretization granularity, and dataset size.

Abstract

Diffusion models have revolutionized various application domains, including computer vision and audio generation. Despite the state-of-the-art performance, diffusion models are known for their slow sample generation due to the extensive number of steps involved. In response, consistency models have been developed to merge multiple steps in the sampling process, thereby significantly boosting the speed of sample generation without compromising quality. This paper contributes towards the first statistical theory for consistency models, formulating their training as a distribution discrepancy minimization problem. Our analysis yields statistical estimation rates based on the Wasserstein distance for consistency models, matching those of vanilla diffusion models. Additionally, our results encompass the training of consistency models through both distillation and isolation methods, demystifying their underlying advantage.

Paper Structure

This paper contains 31 sections, 21 theorems, 143 equations, 2 figures.

Key Result

Lemma 3.1

For the approximator above, it exactly equals to the score function of distribution $\mathcal{X}_t$, i.e. Here, $\widehat{p}_t(\cdot)$ is the density of $\mathcal{X}_t = m(t)\widehat{p_{\mathrm{data}}}\star \mathcal{N}(0,\sigma(t)^2\bm I)$, which is a mixture of Gaussian. Therefore, it has explicit formulation and needs no additional training.

Figures (2)

  • Figure 1: Illustration of Consistency Models: At each time step $t$, the consistency model $f(\cdot, t)$ will map $\mathbf{x}_t$ to $\mathbf{x}_0$ along the trajectory of probability flow ODE. We also demonstrate the score function applied at time $t$ in both distillation training and isolation training.
  • Figure 2: Illustration of $\widehat{X}_{\tau_k}^{\phi, M}$: When starting from distribution $X_{\tau_k}$ at time $\tau_k$ and following the discrete distillation-based backward process, it ends at $\tau_{k-1}$ with underlying law $\widehat{X}_{\tau_k}^{\phi, M}$.

Theorems & Definitions (41)

  • Lemma 3.1
  • proof
  • Remark 4.1
  • Remark 4.2
  • Theorem 4.1: Main Theorem 1: Distillation
  • Remark 4.3
  • Remark 4.4
  • Theorem 4.2: Main Theorem 2: Isolation
  • Remark 4.5
  • Remark 4.6
  • ...and 31 more