Table of Contents
Fetching ...

Dynamics of neural scaling laws in random feature regression with powerlaw-distributed kernel eigenvalues

Jakob Kramp, Javed Lindner, Moritz Helias

TL;DR

This approach allows us to quantitatively explain the dynamics of the generalization error by linking spectral and dynamical properties of learning on data with power law spectra, including phenomena such as neural scaling laws and the effect of early stopping.

Abstract

Training large neural networks exposes neural scaling laws for the generalization error, which points to a universal behavior across network architectures of learning in high dimensions. It was also shown that this effect persists in the limit of highly overparametrized networks as well as the Neural network Gaussian process limit. We here develop a principled understanding of the typical behavior of generalization in Neural Network Gaussian process regression dynamics. We derive a dynamical mean-field theory that captures the typical case learning dynamics: This allows us to unify multiple existing regimes of learning studied in the current literature, namely Bayesian inference on Gaussian processes, gradient flow with or without weight-decay, and stochastic Langevin training dynamics. Employing tools from statistical physics, the unified framework we derive in either of these cases yields an effective description of the high-dimensional microscopic behavior of networks dynamics in terms of lower dimensional order parameters. We show that collective training dynamics may be separated into the dynamics of N independent eigenmodes, those evolution equations are only coupled through collective response functions and a common statistics of an effective, independent noise. Our approach allows us to quantitatively explain the dynamics of the generalization error by linking spectral and dynamical properties of learning on data with power law spectra, including phenomena such as neural scaling laws and the effect of early stopping.

Dynamics of neural scaling laws in random feature regression with powerlaw-distributed kernel eigenvalues

TL;DR

This approach allows us to quantitatively explain the dynamics of the generalization error by linking spectral and dynamical properties of learning on data with power law spectra, including phenomena such as neural scaling laws and the effect of early stopping.

Abstract

Training large neural networks exposes neural scaling laws for the generalization error, which points to a universal behavior across network architectures of learning in high dimensions. It was also shown that this effect persists in the limit of highly overparametrized networks as well as the Neural network Gaussian process limit. We here develop a principled understanding of the typical behavior of generalization in Neural Network Gaussian process regression dynamics. We derive a dynamical mean-field theory that captures the typical case learning dynamics: This allows us to unify multiple existing regimes of learning studied in the current literature, namely Bayesian inference on Gaussian processes, gradient flow with or without weight-decay, and stochastic Langevin training dynamics. Employing tools from statistical physics, the unified framework we derive in either of these cases yields an effective description of the high-dimensional microscopic behavior of networks dynamics in terms of lower dimensional order parameters. We show that collective training dynamics may be separated into the dynamics of N independent eigenmodes, those evolution equations are only coupled through collective response functions and a common statistics of an effective, independent noise. Our approach allows us to quantitatively explain the dynamics of the generalization error by linking spectral and dynamical properties of learning on data with power law spectra, including phenomena such as neural scaling laws and the effect of early stopping.
Paper Structure (20 sections, 76 equations, 4 figures)

This paper contains 20 sections, 76 equations, 4 figures.

Figures (4)

  • Figure 1: Time evolution of normalized mean discrepancy $v_{i}(t)/\bar{w}_{i}=(\bar{w}_{i}-w_{i}(t))/\bar{w}_{i}$. Simulation (full curves) compared to theory (dashed curves, \ref{['eq:effective_eom-1']}). Different curves show different modes $\eta_{i}$ from blue (large $\eta_{i}$) to black (small $\eta_{i}$, see legend).
  • Figure 2: Test error for different strengths $2\beta^{-1}$ of the dynamic noise with $\beta=10000$ (red), $\beta=50$ (green) and $\beta=10$ (blue). The solid curves show the theory \ref{['eq:test_error']}, the dashed lines the simulation. The other parameters are $g\beta=10^{3},$$P=N=100$, $\Lambda_{ij}=i^{-3/2}\delta_{ij}$. The time step used for the simulation is $\mathrm{d} t=10^{-4}$, for the theory it is $\mathrm{d} t=10^{-2}.$ The disorder average in the simulation is taken over $10^{5}$ different realizations of training data.
  • Figure 3: Bias-variance decomposition of the test error. The blue curves show the full test error \ref{['eq:test_error']} (full curve simulation; dashed curve theory), the green curves show the bias contribution to the test loss, the red show the variance part (cf. eq:bias_var_decomp). The expectation value of the kernel follows the power law $\Lambda_{ij}=i^{-3/2}\delta_{ij}$. The time step used for the simulation is $\mathrm{d} t=10^{-4},$for the theory it is $\mathrm{d} t=10^{-2}.$ Disorder average in simulation taken over $10^{5}$ different realizations of training data sets. Other parameters $\beta=10$ and $g\beta=10^{3}$, system at the interpolation threshold ($P=N=10^{2}$).
  • Figure 4: Effect of regularization on early stopping. Left panel: Test error for different strengths $g\beta$ of the dynamic noise. The dashed curves show the theory, the solid curves the simulation. Right panel: Individual test error curves with their respective bias-variance decomposition. Dashed curves are theory, solid curves simulation. Other parameters for all panels $\beta=10^{0},$$P=N=100$, $\Lambda_{ij}=i^{-3/2}\delta_{ij}$. The time step used for the simulation was $\mathrm{d} t=10^{-4}$, for the theory it is $\mathrm{d} t=10^{-2}.$ Disorder average in simulation taken over $10^{5}$ different realizations of training data sets.