Quantitative CLTs in Deep Neural Networks
Stefano Favaro, Boris Hanin, Domenico Marinucci, Ivan Nourdin, Giovanni Peccati
TL;DR
This work analyzes fully connected neural networks with Gaussian initial weights and biases, where hidden widths scale with a large parameter $n$, and derives quantitative central limit theorems comparing finite-width networks to the infinite-width Gaussian process limit. The authors establish one-dimensional, finite-dimensional, and functional CLTs using Stein's method, conditional Gaussian representations, and novel coupling techniques, obtaining rates that scale as powers of $n$ (e.g., $n^{-1/2}$ in 1D, $n^{-1/2}$ or $n^{-1/8}$ in higher-dimensional/functional settings). Key contributions include new conditional-Gaussian bounds, convex-distance results for possibly degenerate covariances, and Sobolev-space formulations that enable functional CLTs with explicit width dependence. These results strengthen the theoretical understanding of the finite-width effects in neural networks and provide sharp, width-dependent bounds that improve upon prior work, with implications for initialization stability and feature-learning regimes beyond NTK. The paper also develops a suite of methodological tools (Stein-based bounds, coupling arguments, and operator-perturbation inequalities) that are likely to influence future probabilistic analyses of random neural networks.
Abstract
We study the distribution of a fully connected neural network with random Gaussian weights and biases in which the hidden layer widths are proportional to a large constant $n$. Under mild assumptions on the non-linearity, we obtain quantitative bounds on normal approximations valid at large but finite $n$ and any fixed network depth. Our theorems show both for the finite-dimensional distributions and the entire process, that the distance between a random fully connected network (and its derivatives) to the corresponding infinite width Gaussian process scales like $n^{-γ}$ for $γ>0$, with the exponent depending on the metric used to measure discrepancy. Our bounds are strictly stronger in terms of their dependence on network width than any previously available in the literature; in the one-dimensional case, we also prove that they are optimal, i.e., we establish matching lower bounds.
