Table of Contents
Fetching ...

Optimal estimation of a factorizable density using diffusion models with ReLU neural networks

Jianqing Fan, Yihong Gu, Ximing Li

TL;DR

The paper addresses the problem of estimating densities with a low-dimensional factorizable structure using score-based diffusion models and standard fully connected ReLU networks. It shows that, even though the diffusion process may erase the low-dimensional factorization at many timesteps, the diffused score can be decomposed into compositions of low-dimensional or super-smooth components, enabling neural networks to achieve minimax-optimal rates in TV distance of $n^{-\beta/(2\beta+d^*)}$ and, with a piecewise estimator, improved $W_1$ rates of $n^{-(\beta+d^*/d)/(2\beta+d^*)}$ (up to logs). The main technical contribution is a dimension-free approximation bound for the diffused score across time and a versatile neural-network construction that attains these rates without relying on specialized architectures. Practically, the results imply that vanilla diffusion models with ReLU score estimators are adaptively efficient for a broad class of structured densities, and the approach yields implementable sampling procedures with provable guarantees. The work thus advances the theoretical understanding of diffusion models in structured-density estimation and informs the design of scalable, structure-aware generative modeling methods.

Abstract

This paper investigates the score-based diffusion models for density estimation when the target density admits a factorizable low-dimensional nonparametric structure. To be specific, we show that when the log density admits a $d^*$-way interaction model with $β$-smooth components, the vanilla diffusion model, which uses a fully connected ReLU neural network for score matching, can attain optimal $n^{-β/(2β+d^*)}$ statistical rate of convergence in total variation distance. This is, to the best of our knowledge, the first in the literature showing that diffusion models with standard configurations can adapt to the low-dimensional factorizable structures. The main challenge is that the low-dimensional factorizable structure no longer holds for most of the diffused timesteps, and it is very challenging to show that these diffused score functions can be well approximated without a significant increase in the number of network parameters. Our key insight is to demonstrate that the diffused score functions can be decomposed into a composition of either super-smooth or low-dimensional components, leading to a new approximation error analysis of ReLU neural networks with respect to the diffused score function. The rate of convergence under the 1-Wasserstein distance is also derived with a slight modification of the method.

Optimal estimation of a factorizable density using diffusion models with ReLU neural networks

TL;DR

The paper addresses the problem of estimating densities with a low-dimensional factorizable structure using score-based diffusion models and standard fully connected ReLU networks. It shows that, even though the diffusion process may erase the low-dimensional factorization at many timesteps, the diffused score can be decomposed into compositions of low-dimensional or super-smooth components, enabling neural networks to achieve minimax-optimal rates in TV distance of and, with a piecewise estimator, improved rates of (up to logs). The main technical contribution is a dimension-free approximation bound for the diffused score across time and a versatile neural-network construction that attains these rates without relying on specialized architectures. Practically, the results imply that vanilla diffusion models with ReLU score estimators are adaptively efficient for a broad class of structured densities, and the approach yields implementable sampling procedures with provable guarantees. The work thus advances the theoretical understanding of diffusion models in structured-density estimation and informs the design of scalable, structure-aware generative modeling methods.

Abstract

This paper investigates the score-based diffusion models for density estimation when the target density admits a factorizable low-dimensional nonparametric structure. To be specific, we show that when the log density admits a -way interaction model with -smooth components, the vanilla diffusion model, which uses a fully connected ReLU neural network for score matching, can attain optimal statistical rate of convergence in total variation distance. This is, to the best of our knowledge, the first in the literature showing that diffusion models with standard configurations can adapt to the low-dimensional factorizable structures. The main challenge is that the low-dimensional factorizable structure no longer holds for most of the diffused timesteps, and it is very challenging to show that these diffused score functions can be well approximated without a significant increase in the number of network parameters. Our key insight is to demonstrate that the diffused score functions can be decomposed into a composition of either super-smooth or low-dimensional components, leading to a new approximation error analysis of ReLU neural networks with respect to the diffused score function. The rate of convergence under the 1-Wasserstein distance is also derived with a slight modification of the method.

Paper Structure

This paper contains 16 sections, 3 theorems, 38 equations, 2 figures, 2 algorithms.

Key Result

Theorem 3.1

Assume Conditions condition:lowerbound and condition:betat hold. There exists constants $c_3,c_4,c_5, c_6,c_7$ depending only on $(c_1,c_2, \beta,d,d^*,C)$ such that the following holds. For any $W,L$ satisfying $\min\{W, L\}\geq \left(1+\log (WL) \right)^{c_5}$, letting $\underline{T}=(WL)^{-c_3}$ The dependency of $c_3,\ldots,c_6$ on $(c_1,c_2, \beta,d,d^*,C)$ can be found in Condition D.1.

Figures (2)

  • Figure 1: The illustrating examples with 4 variables ($d=4$) and $d^*=2$. For the Markov random field, the edge between node $x$ and $y$ indicates that variables $x$ and $y$ may not be independent given all the other variables. For the Bayesian network, the arrow from node $x$ to $y$ indicates that $x$ is the direct cause of $y$.
  • Figure 2: Figures illustrating the construction of the neural network approximator for $p_t$. The approximator for $p_t$ is constructed in a bottom-to-top manner. The blue arrow from "low rank" to node $x$ indicates that $x$ is a product of $d^*$-dimensional Hölder-smooth functions, whose neural network approximators are constructed with Theorem 1.1 in lu2021deep. For example, the nodes $G_{\mathcal{S}\backslash\mathcal{A}}, G_{\mathcal{A}\backslash\mathcal{B}}$ are product of $d^*$-dimensional $(\beta,C)$-functions, and the node $G^{(1)}_{\mathcal{A}\backslash\mathcal{B}}$ is a product of $d^*$-dimensional $(\beta-1,C)$-functions. The red arrow from "smoothness" to node $x$ indicates that $x$ is super-smooth or its parent node is negligible, whose neural network approximator is constructed with Theorem A.18. For example, $p_{t,\mathcal{B}}$ is super-smooth otherwise $\Delta_{\mathcal{A}}$ is negligible. Black arrows go from nodes $x,y$ to $z$ indicate that $z$ is a sum-of-product of $x$ and $y.$ Thus the neural network approximator for $z$ can be constructed using the approximators for $x$ and $y.$

Theorems & Definitions (12)

  • Remark 2.1
  • Definition 2.1: $(\beta,C)$-smooth
  • Definition 2.2: Exponential-interaction model
  • Remark 2.2
  • Theorem 3.1: Approximation error
  • Remark 3.1
  • Remark 3.2
  • Remark 3.3
  • Theorem 3.2: Error bound in TV distance
  • Remark 3.4
  • ...and 2 more