Table of Contents
Fetching ...

Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks

Andrea Montanari, Pierfrancesco Urbani

TL;DR

This work uses dynamical mean field theory to analyze gradient-flow training of large two-layer networks, uncovering a time-scale separation between fast feature learning and slow overfitting/feature unlearning. By replacing data residuals with Gaussian processes and solving DMFT equations in the large-width limit, the authors derive precise dynamical regimes for both pure-noise and latent-structure data under lazy and mean-field initializations. They identify interpolation thresholds, provide non-asymptotic bounds that separate learning from overfitting, and connect the results to both lazy (NTK-like) and nonlinear feature-learning behaviors. The findings illuminate why early stopping regularizes generalization, how initialization scale and network width shape dynamics, and offer a rigorous framework bridging statistical physics methods and deep learning generalization theory.

Abstract

Understanding the inductive bias and generalization properties of large overparametrized machine learning models requires to characterize the dynamics of the training algorithm. We study the learning dynamics of large two-layer neural networks via dynamical mean field theory, a well established technique of non-equilibrium statistical physics. We show that, for large network width $m$, and large number of samples per input dimension $n/d$, the training dynamics exhibits a separation of timescales which implies: $(i)$~The emergence of a slow time scale associated with the growth in Gaussian/Rademacher complexity of the network; $(ii)$~Inductive bias towards small complexity if the initialization has small enough complexity; $(iii)$~A dynamical decoupling between feature learning and overfitting regimes; $(iv)$~A non-monotone behavior of the test error, associated `feature unlearning' regime at large times.

Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks

TL;DR

This work uses dynamical mean field theory to analyze gradient-flow training of large two-layer networks, uncovering a time-scale separation between fast feature learning and slow overfitting/feature unlearning. By replacing data residuals with Gaussian processes and solving DMFT equations in the large-width limit, the authors derive precise dynamical regimes for both pure-noise and latent-structure data under lazy and mean-field initializations. They identify interpolation thresholds, provide non-asymptotic bounds that separate learning from overfitting, and connect the results to both lazy (NTK-like) and nonlinear feature-learning behaviors. The findings illuminate why early stopping regularizes generalization, how initialization scale and network width shape dynamics, and offer a rigorous framework bridging statistical physics methods and deep learning generalization theory.

Abstract

Understanding the inductive bias and generalization properties of large overparametrized machine learning models requires to characterize the dynamics of the training algorithm. We study the learning dynamics of large two-layer neural networks via dynamical mean field theory, a well established technique of non-equilibrium statistical physics. We show that, for large network width , and large number of samples per input dimension , the training dynamics exhibits a separation of timescales which implies: ~The emergence of a slow time scale associated with the growth in Gaussian/Rademacher complexity of the network; ~Inductive bias towards small complexity if the initialization has small enough complexity; ~A dynamical decoupling between feature learning and overfitting regimes; ~A non-monotone behavior of the test error, associated `feature unlearning' regime at large times.

Paper Structure

This paper contains 62 sections, 7 theorems, 285 equations, 33 figures.

Key Result

Theorem 1

Under the GF dynamics eq:GFlowFirst, and the data distribution in the introduction (with $k$ arbitrary), further assume $\|\sigma\|_{\hbox{\tiny\rm Lip}}, \|\sigma\|_{\infty}\le L$, $|\varphi(0)|,\|\varphi\|_{\hbox{\tiny\rm Lip}}\le L$, $\|{\boldsymbol{a}}(0)\|_\infty\le a_0$, for some $a_0\ge 1$ an

Figures (33)

  • Figure 1: Three dynamical regimes of learning in a two-layer neural networks, with $m$ hidden neurons. Training data comprises $n$ points in $d$ dimensions distributed according to a single index model. We assume $n,m,d$ all large with $n/md=\alpha$ (here $\alpha=0.3$). Blue: test error. Purple: train error. Red: $\ell_1$ norm of second-layer weights (a proxy for model complexity).
  • Figure 2: Evolution of second-layer weights (left) and train error (right) when fitting pure noise data. Here we use mean field initialization, $h(z) = (9/10)z + (1/6)z^3$, $\alpha=0.4$ and $\tau=0.6$. Symbols: SGD results on actual 2-layer networks with $d=200$, $n=\alpha md$ (averaged over 10 simulations). Continuous viridis lines: Numerical solution of the DMFT equations. Note that the second layer weights are given in terms of a scalar quantity as the result of the statistically symmetric initialization.
  • Figure 3: Train/test error (right) when fitting data from a single index model. We set $h(z )=\widehat{\varphi}(z)=(9/10)z+z^2/2$, $\tau=0.3$ and $\alpha=0.3$. Lines correspond to predictions from the DMFT (continuous: train error; dashed: test error). Black continuous line is the $m\to \infty$ value. Right: Same data plotted versus $t$.
  • Figure 4: Training dynamics under a single-index model. We set $h(q)=\widehat{\varphi}(q)=(9/10)q+q^3/6$, $\tau=0.3$ and $\alpha=0.3$, under mean field initialization. Left: second-layer weights. Right: train and test error. Symbols are empirical results for SGD with actual two-layer neural networks with $d=200$, $n=\alpha m d$ (averaged over $10$ simulations). Lines correspond to predictions from the DMFT (on the right, continuous: train error; dashed: test error).
  • Figure 5: Left: second layer weights on the scale $\sqrt m$ as a function of $t/m$. Curves appear to collapse on a master curve. The red arrow denotes $\gamma_{GF}^*$ and the curves appear to converge to that limit. Center: the projection of the first layer weights on the latent space in the single index model as a function of time on timescales of order $m$. Right: difference between test and train error as a function of the second layer weights on the scale $\sqrt m$. The finite $m$ curve are approaching a scaling curve which coincides with the one obtained by evaluating the same quantity but with a lazy initialization and fixed second layer weights.
  • ...and 28 more figures

Theorems & Definitions (12)

  • Theorem 1
  • Theorem 2
  • Remark 3.1
  • Lemma J.1
  • proof
  • Lemma J.2
  • proof
  • Corollary J.3
  • Lemma J.4
  • proof
  • ...and 2 more