Table of Contents
Fetching ...

Unveil Conditional Diffusion Models with Classifier-free Guidance: A Sharp Statistical Theory

Hengyu Fu, Zhuoran Yang, Mengdi Wang, Minshuo Chen

TL;DR

This work delivers the first sharp statistical theory for conditional diffusion models trained with classifier-free guidance, linking Hölder-smooth ground-truth conditionals to tractable, data-efficient learning. It introduces a universal conditional score-approximation framework based on diffused local polynomials, achieving rates adaptive to the data’s smoothness and, under stronger density assumptions, substantially faster convergence. Building on this, the authors establish end-to-end distribution-estimation guarantees with minimax-optimal rates, and extend the theory to model-based RL transition kernels, reward-directed generation, and linear inverse problems. The results provide rigorous foundations for the practical success of conditional diffusion methods across domains, highlighting how data regularity and coverage fundamentally shape statistical performance.

Abstract

Conditional diffusion models serve as the foundation of modern image synthesis and find extensive application in fields like computational biology and reinforcement learning. In these applications, conditional diffusion models incorporate various conditional information, such as prompt input, to guide the sample generation towards desired properties. Despite the empirical success, theory of conditional diffusion models is largely missing. This paper bridges this gap by presenting a sharp statistical theory of distribution estimation using conditional diffusion models. Our analysis yields a sample complexity bound that adapts to the smoothness of the data distribution and matches the minimax lower bound. The key to our theoretical development lies in an approximation result for the conditional score function, which relies on a novel diffused Taylor approximation technique. Moreover, we demonstrate the utility of our statistical theory in elucidating the performance of conditional diffusion models across diverse applications, including model-based transition kernel estimation in reinforcement learning, solving inverse problems, and reward conditioned sample generation.

Unveil Conditional Diffusion Models with Classifier-free Guidance: A Sharp Statistical Theory

TL;DR

This work delivers the first sharp statistical theory for conditional diffusion models trained with classifier-free guidance, linking Hölder-smooth ground-truth conditionals to tractable, data-efficient learning. It introduces a universal conditional score-approximation framework based on diffused local polynomials, achieving rates adaptive to the data’s smoothness and, under stronger density assumptions, substantially faster convergence. Building on this, the authors establish end-to-end distribution-estimation guarantees with minimax-optimal rates, and extend the theory to model-based RL transition kernels, reward-directed generation, and linear inverse problems. The results provide rigorous foundations for the practical success of conditional diffusion methods across domains, highlighting how data regularity and coverage fundamentally shape statistical performance.

Abstract

Conditional diffusion models serve as the foundation of modern image synthesis and find extensive application in fields like computational biology and reinforcement learning. In these applications, conditional diffusion models incorporate various conditional information, such as prompt input, to guide the sample generation towards desired properties. Despite the empirical success, theory of conditional diffusion models is largely missing. This paper bridges this gap by presenting a sharp statistical theory of distribution estimation using conditional diffusion models. Our analysis yields a sample complexity bound that adapts to the smoothness of the data distribution and matches the minimax lower bound. The key to our theoretical development lies in an approximation result for the conditional score function, which relies on a novel diffused Taylor approximation technique. Moreover, we demonstrate the utility of our statistical theory in elucidating the performance of conditional diffusion models across diverse applications, including model-based transition kernel estimation in reinforcement learning, solving inverse problems, and reward conditioned sample generation.
Paper Structure (103 sections, 55 theorems, 416 equations, 5 figures)

This paper contains 103 sections, 55 theorems, 416 equations, 5 figures.

Key Result

Theorem 3.2

Suppose Assumption assump:sub holds. For sufficiently large $N$ and constants $C_{\sigma}, C_{\alpha}>0$, by taking the early-stopping time $t_0=N^{-C_\sigma}$ and the terminal time $T=C_{\alpha}\log N$, there exists ${\mathbf s} \in \mathcal{F}(M_t, W, \kappa, L, K)$ such that for any $\mathbf{y} \ The hyperparameters in the ReLU neural network class $\mathcal{F}$ satisfy where $\mathcal{O}$ hid

Figures (5)

  • Figure 1: Comparison of approximation schemes in Theorems \ref{['thm::score approx']} and \ref{['thm::score approx exp']}. On the left panel, we use diffused local polynomials to approximate the numerator and denominator on a truncated cube. However, the existence of of small density region necessitates a truncation at $\epsilon_{\rm low}$, which compromises the approximation efficiency. In contrast, under Assumption \ref{['assump::expdensity']}, we eliminate small density regions within the cube, which leads to a fast approximation.
  • Figure 2: The network architecture of $\mathbf{f}_3^{\rm ReLU}$. We implement all the components of $\mathbf{f}_3$ ($f_1$, $\mathbf{f}_2$ and $\sigma_t$) through ReLU networks and combine them using the ReLU-approximated operators (product, inverse (reciprocal) and entrywise-min) to express $\mathbf{f}_3$ according to its definition in \ref{['equ:: definition fb3']}.
  • Figure 3: Network architecture of $f^{\text{ReLU}}_{v,k,j}$. We implement all the basic functions (e.g., $x$, $\alpha_t$ and $\sigma_t$) through ReLU networks and combine them using the ReLU-expressed operators (product, inverse, clip and poly) to express $f_{v,k,j}$ according to its definition in \ref{['equ::single component']}.
  • Figure 4: Network architecture of $\mathbf{f}^{\text{ReLU}}_3$. We implement all the components of $\mathbf{f}_3$ ($f_1$, $\mathbf{f}_2$, $\widehat{\sigma}_t$ and $\widehat{\alpha}_t$) through ReLU networks and combine them using the ReLU-expressed operators (product, inverse and entrywise-min/max) to express $\mathbf{f}_3$ according to its definition in \ref{['equ::fb3 new definition']}.
  • Figure 5: Network architecture of $f^{\text{ReLU}}_{v,k,j}$. We implement all the basic functions (e.g., $x$, $\widehat{\alpha}_t$ and $\widehat{\sigma}_t$) through ReLU networks and combine them using the ReLU-expressed operators (product, inverse, clip and poly) to express $f_{v,k,j}$ according to its definition in \ref{['equ::single component new']}.

Theorems & Definitions (95)

  • Definition 2.1: Hölder norm
  • Theorem 3.2
  • Theorem 3.4
  • Theorem 4.1
  • Theorem 4.2
  • Proposition 4.3
  • Proposition 4.5
  • Proposition 5.2
  • Proposition 5.4
  • Lemma A.1: Truncate $\mathbf{x}$
  • ...and 85 more