Nonparametric estimation of a factorizable density using diffusion models

Hyeok Kyu Kwon; Dongha Kim; Ilsang Ohn; Minwoo Chae

Nonparametric estimation of a factorizable density using diffusion models

Hyeok Kyu Kwon, Dongha Kim, Ilsang Ohn, Minwoo Chae

TL;DR

This paper views diffusion models as an implicit approach to nonparametric density estimation and study them within a statistical framework to analyze their surprising performance, and demonstrates that an implicit density estimator constructed from diffusion models achieves the minimax optimal rate with respect to the total variation distance.

Abstract

In recent years, diffusion models, and more generally score-based deep generative models, have achieved remarkable success in various applications, including image and audio generation. In this paper, we view diffusion models as an implicit approach to nonparametric density estimation and study them within a statistical framework to analyze their surprising performance. A key challenge in high-dimensional statistical inference is leveraging low-dimensional structures inherent in the data to mitigate the curse of dimensionality. We assume that the underlying density exhibits a low-dimensional structure by factorizing into low-dimensional components, a property common in examples such as Bayesian networks and Markov random fields. Under suitable assumptions, we demonstrate that an implicit density estimator constructed from diffusion models adapts to the factorization structure and achieves the minimax optimal rate with respect to the total variation distance. In constructing the estimator, we design a sparse weight-sharing neural network architecture, where sparsity and weight-sharing are key features of practical architectures such as convolutional neural networks and recurrent neural networks.

Nonparametric estimation of a factorizable density using diffusion models

TL;DR

Abstract

Paper Structure (44 sections, 26 theorems, 558 equations, 5 figures)

This paper contains 44 sections, 26 theorems, 558 equations, 5 figures.

Introduction
Notations and Definitions
Diffusion Models
Sparse Weight-Sharing Neural Networks
Factorizable Densities
Factrorization Assumption
Example: Bayesian Networks
Example: Markov Random Fields
Main Results
Assumptions
Convergence Rate
Approximation Theory
Sub-Optimality of a Vanilla Score Matching Estimator
Numerical Experiments
Data Set Descriptions
...and 29 more sections

Key Result

Theorem 5.1

Suppose that $p_0$ satisfies ( S), ( L), and ( B). Let $\tau_{\min}$ and $\tau_{\max}$ be constants with Let $\underline{T} = n^{- \tau_{\rm min}}$ and $\overline{T} = \tau_{\rm max} \log n$. Then, for every $n \geq C_{2}$, there exist a collection of permutation matrices $\mathcal{P}_{{\bf m}} = ( ( \mathcal{Q}_{i}, \mathcal{R}_{i} ) )_{i \in [L-1] }$ and a class of weight-sharing neural network

Figures (5)

Figure 1: Example of a 2-dimensional convolution operation with an input ${\bf x} \in {\mathbb R}^{16}$, a filter vector ${\bf w} \in {\mathbb R}^{4}$ and output ${\bf y} \in {\mathbb R}^9$. The operation can be represented as a matrix multiplication goodfellow2016deep, given by ${\bf y} = \widetilde{W} {\bf x}$.
Figure 2: Examples of directed and undirected graphical model structures for a 7-dimensional random vector. In both cases, the effective dimension $d$ is strictly less than $D=7$.
Figure 3: An image of the digit '0' from the MNIST data set lecun1998gradient, along with two possible undirected graph structures for MNIST. For each pixel, a larger neighborhood may be considered depending on the degree of spatial correlations.
Figure 4: An illustration of why weight-sharing helps reduce model complexity: At some middle layers of the network, we need to approximate a map with inputs ${\bf x}$ and ${\bf y}_{{\bf j}}, {\bf j} \in [m]^D$, and outputs $g_{{\bf j}} = g({\bf x}, {\bf y}_{{\bf j}}), {\bf j} \in [m]^D$. Since a single function $g$ are approximated for $m^D$ instances, leaving all network parameters as free parameters is inefficient. Weight-sharing can significantly reduce the number of distinct parameters. In this illustration, all edges with the same color share the same weight parameters.
Figure 5: Wasserstein-1 distance values for three case (From left to right: $d = 1$, $2$, and $5$.).

Theorems & Definitions (53)

Remark 2.1
Remark 2.2
Theorem 5.1
Remark 5.2
Theorem 5.3
Lemma A.1: Upper and lower bounds for $p_t({\bf x})$
proof
Lemma A.2: Boundedness of score function
proof
Lemma A.3: Boundedness of derivatives
...and 43 more

Nonparametric estimation of a factorizable density using diffusion models

TL;DR

Abstract

Nonparametric estimation of a factorizable density using diffusion models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (53)