A Survey on Statistical Theory of Deep Learning: Approximation, Training Dynamics, and Generative Models

Namjoon Suh; Guang Cheng

A Survey on Statistical Theory of Deep Learning: Approximation, Training Dynamics, and Generative Models

Namjoon Suh, Guang Cheng

TL;DR

This survey organizes theoretical progress on deep learning into three pillars: approximation theory, training dynamics, and generative-model theory. It highlights how deep nets achieve expressive power and favorable nonparametric rates through depth, while also detailing NTK and Mean-Field regimes that explain how gradient-based training behaves in wide and infinite-width limits. It reviews statistical guarantees for GANs, diffusion models, and in-context learning, illustrating how these models approximate distributions and adapt to data structures under various assumptions. The work emphasizes bridging kernel-like behavior and feature learning, and it underscores open questions about finite-width behavior, optimal architectures, and the theory of modern generative AI systems. Overall, it provides a foundation for principled design and analysis of deep learning methods in practical, high-dimensional settings, with explicit mathematical characterizations throughout.

Abstract

In this article, we review the literature on statistical theories of neural networks from three perspectives: approximation, training dynamics and generative models. In the first part, results on excess risks for neural networks are reviewed in the nonparametric framework of regression (and classification in Appendix~{\color{blue}B}). These results rely on explicit constructions of neural networks, leading to fast convergence rates of excess risks. Nonetheless, their underlying analysis only applies to the global minimizer in the highly non-convex landscape of deep neural networks. This motivates us to review the training dynamics of neural networks in the second part. Specifically, we review papers that attempt to answer ``how the neural network trained via gradient-based methods finds the solution that can generalize well on unseen data.'' In particular, two well-known paradigms are reviewed: the Neural Tangent Kernel (NTK) paradigm, and Mean-Field (MF) paradigm. Last but not least, we review the most recent theoretical advancements in generative models including Generative Adversarial Networks (GANs), diffusion models, and in-context learning (ICL) in the Large Language Models (LLMs) from two perpsectives reviewed previously, i.e., approximation and training dynamics.

A Survey on Statistical Theory of Deep Learning: Approximation, Training Dynamics, and Generative Models

TL;DR

Abstract

Paper Structure (21 sections, 1 theorem, 45 equations, 2 figures, 3 tables)

This paper contains 21 sections, 1 theorem, 45 equations, 2 figures, 3 tables.

Introduction
Why theory is important?
Roadmap of the paper
Existing surveys on deep learning theory
Approximation theory-based statistical guarantees
Expressive power of fully-connected networks
Statistical guarantees for regression tasks
Training dynamics-based statistical guarantees
Neural Tangent Kernel Perspective
Mean-Field Perspective
Beyond the NTK and Mean-Field regimes
Statistical Guarantees of Generative Models
Generative Adversarial Networks (GAN)
Score-based diffusion models
In-Context Learning in Large Language Model
...and 6 more sections

Key Result

Proposition 3.1

(Theorem $4.3$ in cao2019towards, Proposition 5 in bietti2019inductive) For the neural tangent kernel corresponding to a two-layer feed-forward ReLU network, the eigenvalues $(\mu_{k})_{k}$ satisfy the following:

Figures (2)

Figure 1: Compared to classical linear estimators (i.e., wavelet, kernel ridge regressors, etc), sparsely connected neural networks are more adaptive in estimating functions $f_{\rho}$ with special structures. The figure illustrates the different settings of function classes $\mathcal{G}$ where neural networks exhibit superior adaptabilities over the classical estimators.
Figure 2: Development of the literature (y-axis) on algorithm-based neural network analysis over time (x-axis). We view the ultimate goal (represented as star) of this line of research is to theoretically demystify feature learning of neural nets with deep layers and finite width, closing the gap with the practice. Note that kernel regressor in NTK regime does not exhibit feature learning functionality.

Theorems & Definitions (3)

Proposition 3.1
Definition 4.1
Definition 4.2: In-context learning garg2022can

A Survey on Statistical Theory of Deep Learning: Approximation, Training Dynamics, and Generative Models

TL;DR

Abstract

A Survey on Statistical Theory of Deep Learning: Approximation, Training Dynamics, and Generative Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (3)