Quantitative convergence of trained single layer neural networks to Gaussian processes
Eloy Mosig, Andrea Agazzi, Dario Trevisan
TL;DR
The paper establishes a quantitative, finite-width bound on the convergence of shallow neural networks trained by gradient descent to their Gaussian-process counterparts in the infinite-width limit. By analyzing continuous-time gradient flow, empirical NTK dynamics, and the associated linearized regime, it proves that for any fixed input $x$ and training time $t$ up to a polynomial in the width, the squared quadratic Wasserstein distance satisfies $\mathcal{W}_2^2(f(x;\theta_t),G_t(x)) = \mathcal{O}\big(\frac{\log n_1}{n_1}\big)$, with constants depending on the limiting kernel eigenvalue and activation regularity. The results extend previous initialization bounds to the full training trajectory, offering explicit, architecture-aware rates that quantify how width, input dimension, and training time govern the Gaussian-process approximation. Numerical experiments validate the theory, showing Gaussian-process predictions closely track trained networks and that the Wasserstein distance decays with width as predicted. Overall, the work bridges theory and practice by providing actionable finite-width guarantees for NTK-based Gaussian-process approximations during training, informing when kernel-based analyses reliably describe trained models.
Abstract
In this paper, we study the quantitative convergence of shallow neural networks trained via gradient descent to their associated Gaussian processes in the infinite-width limit. While previous work has established qualitative convergence under broad settings, precise, finite-width estimates remain limited, particularly during training. We provide explicit upper bounds on the quadratic Wasserstein distance between the network output and its Gaussian approximation at any training time $t \ge 0$, demonstrating polynomial decay with network width. Our results quantify how architectural parameters, such as width and input dimension, influence convergence, and how training dynamics affect the approximation error.
