Optimal estimation of a factorizable density using diffusion models with ReLU neural networks
Jianqing Fan, Yihong Gu, Ximing Li
TL;DR
The paper addresses the problem of estimating densities with a low-dimensional factorizable structure using score-based diffusion models and standard fully connected ReLU networks. It shows that, even though the diffusion process may erase the low-dimensional factorization at many timesteps, the diffused score can be decomposed into compositions of low-dimensional or super-smooth components, enabling neural networks to achieve minimax-optimal rates in TV distance of $n^{-\beta/(2\beta+d^*)}$ and, with a piecewise estimator, improved $W_1$ rates of $n^{-(\beta+d^*/d)/(2\beta+d^*)}$ (up to logs). The main technical contribution is a dimension-free approximation bound for the diffused score across time and a versatile neural-network construction that attains these rates without relying on specialized architectures. Practically, the results imply that vanilla diffusion models with ReLU score estimators are adaptively efficient for a broad class of structured densities, and the approach yields implementable sampling procedures with provable guarantees. The work thus advances the theoretical understanding of diffusion models in structured-density estimation and informs the design of scalable, structure-aware generative modeling methods.
Abstract
This paper investigates the score-based diffusion models for density estimation when the target density admits a factorizable low-dimensional nonparametric structure. To be specific, we show that when the log density admits a $d^*$-way interaction model with $β$-smooth components, the vanilla diffusion model, which uses a fully connected ReLU neural network for score matching, can attain optimal $n^{-β/(2β+d^*)}$ statistical rate of convergence in total variation distance. This is, to the best of our knowledge, the first in the literature showing that diffusion models with standard configurations can adapt to the low-dimensional factorizable structures. The main challenge is that the low-dimensional factorizable structure no longer holds for most of the diffused timesteps, and it is very challenging to show that these diffused score functions can be well approximated without a significant increase in the number of network parameters. Our key insight is to demonstrate that the diffused score functions can be decomposed into a composition of either super-smooth or low-dimensional components, leading to a new approximation error analysis of ReLU neural networks with respect to the diffused score function. The rate of convergence under the 1-Wasserstein distance is also derived with a slight modification of the method.
