Table of Contents
Fetching ...

Quantitative convergence of trained single layer neural networks to Gaussian processes

Eloy Mosig, Andrea Agazzi, Dario Trevisan

TL;DR

The paper establishes a quantitative, finite-width bound on the convergence of shallow neural networks trained by gradient descent to their Gaussian-process counterparts in the infinite-width limit. By analyzing continuous-time gradient flow, empirical NTK dynamics, and the associated linearized regime, it proves that for any fixed input $x$ and training time $t$ up to a polynomial in the width, the squared quadratic Wasserstein distance satisfies $\mathcal{W}_2^2(f(x;\theta_t),G_t(x)) = \mathcal{O}\big(\frac{\log n_1}{n_1}\big)$, with constants depending on the limiting kernel eigenvalue and activation regularity. The results extend previous initialization bounds to the full training trajectory, offering explicit, architecture-aware rates that quantify how width, input dimension, and training time govern the Gaussian-process approximation. Numerical experiments validate the theory, showing Gaussian-process predictions closely track trained networks and that the Wasserstein distance decays with width as predicted. Overall, the work bridges theory and practice by providing actionable finite-width guarantees for NTK-based Gaussian-process approximations during training, informing when kernel-based analyses reliably describe trained models.

Abstract

In this paper, we study the quantitative convergence of shallow neural networks trained via gradient descent to their associated Gaussian processes in the infinite-width limit. While previous work has established qualitative convergence under broad settings, precise, finite-width estimates remain limited, particularly during training. We provide explicit upper bounds on the quadratic Wasserstein distance between the network output and its Gaussian approximation at any training time $t \ge 0$, demonstrating polynomial decay with network width. Our results quantify how architectural parameters, such as width and input dimension, influence convergence, and how training dynamics affect the approximation error.

Quantitative convergence of trained single layer neural networks to Gaussian processes

TL;DR

The paper establishes a quantitative, finite-width bound on the convergence of shallow neural networks trained by gradient descent to their Gaussian-process counterparts in the infinite-width limit. By analyzing continuous-time gradient flow, empirical NTK dynamics, and the associated linearized regime, it proves that for any fixed input and training time up to a polynomial in the width, the squared quadratic Wasserstein distance satisfies , with constants depending on the limiting kernel eigenvalue and activation regularity. The results extend previous initialization bounds to the full training trajectory, offering explicit, architecture-aware rates that quantify how width, input dimension, and training time govern the Gaussian-process approximation. Numerical experiments validate the theory, showing Gaussian-process predictions closely track trained networks and that the Wasserstein distance decays with width as predicted. Overall, the work bridges theory and practice by providing actionable finite-width guarantees for NTK-based Gaussian-process approximations during training, informing when kernel-based analyses reliably describe trained models.

Abstract

In this paper, we study the quantitative convergence of shallow neural networks trained via gradient descent to their associated Gaussian processes in the infinite-width limit. While previous work has established qualitative convergence under broad settings, precise, finite-width estimates remain limited, particularly during training. We provide explicit upper bounds on the quadratic Wasserstein distance between the network output and its Gaussian approximation at any training time , demonstrating polynomial decay with network width. Our results quantify how architectural parameters, such as width and input dimension, influence convergence, and how training dynamics affect the approximation error.

Paper Structure

This paper contains 25 sections, 17 theorems, 137 equations, 1 figure.

Key Result

Theorem 3.4

Under Assumptions ass:gaussian_initialization, ass:positivity_kinf, ass:bounded and ass:r, for each test point $x\in \mathbb{R}^{n_0}$ there exist positive constants $a_1$ and $a_2$ not depending on $n_0,n_1$ nor $t$ such that: Here $r$ is the constant appearing in Assumption ass:r.

Figures (1)

  • Figure 1: The Gaussian process approximates the neural networks during training (left and center images), and it converges in 2-Wasserstein space to $f_t$ (right image). On the rightmost image, the blue points represent the empirical Wasserstein distance between $f$ and $G$ for increasing widths, and the red plot is the power-law fit between the blue points.

Theorems & Definitions (43)

  • Definition 2.1
  • Definition 2.2
  • Remark 3.1
  • Remark 3.2
  • Definition 3.3
  • Theorem 3.4
  • Remark 3.5
  • Proposition 3.6
  • Proposition 3.7
  • Lemma A.1
  • ...and 33 more