Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks
Andrea Montanari, Pierfrancesco Urbani
TL;DR
This work uses dynamical mean field theory to analyze gradient-flow training of large two-layer networks, uncovering a time-scale separation between fast feature learning and slow overfitting/feature unlearning. By replacing data residuals with Gaussian processes and solving DMFT equations in the large-width limit, the authors derive precise dynamical regimes for both pure-noise and latent-structure data under lazy and mean-field initializations. They identify interpolation thresholds, provide non-asymptotic bounds that separate learning from overfitting, and connect the results to both lazy (NTK-like) and nonlinear feature-learning behaviors. The findings illuminate why early stopping regularizes generalization, how initialization scale and network width shape dynamics, and offer a rigorous framework bridging statistical physics methods and deep learning generalization theory.
Abstract
Understanding the inductive bias and generalization properties of large overparametrized machine learning models requires to characterize the dynamics of the training algorithm. We study the learning dynamics of large two-layer neural networks via dynamical mean field theory, a well established technique of non-equilibrium statistical physics. We show that, for large network width $m$, and large number of samples per input dimension $n/d$, the training dynamics exhibits a separation of timescales which implies: $(i)$~The emergence of a slow time scale associated with the growth in Gaussian/Rademacher complexity of the network; $(ii)$~Inductive bias towards small complexity if the initialization has small enough complexity; $(iii)$~A dynamical decoupling between feature learning and overfitting regimes; $(iv)$~A non-monotone behavior of the test error, associated `feature unlearning' regime at large times.
