Table of Contents
Fetching ...

How Feature Learning Can Improve Neural Scaling Laws

Blake Bordelon, Alexander Atanasov, Cengiz Pehlevan

TL;DR

The paper develops a solvable theory of neural scaling laws that incorporates feature learning beyond the kernel regime using a two-layer linear network with evolving features. By characterizing the data through power‑law spectra with capacity α and source β, it identifies hard, easy, and super‑easy regimes and shows that feature learning can nearly double the scaling exponent for hard tasks, while leaving easy tasks unchanged. The authors derive explicit DMFT equations, obtain scaling predictions, and validate them with experiments on Sobolev tasks and vision datasets, highlighting when feature learning improves efficiency under fixed compute. The work provides a framework to optimize compute allocation between model size and training time in the rich regime and discusses limitations and directions for extending these results to nonlinear and adaptive settings.

Abstract

We develop a solvable model of neural scaling laws beyond the kernel limit. Theoretical analysis of this model shows how performance scales with model size, training time, and the total amount of available data. We identify three scaling regimes corresponding to varying task difficulties: hard, easy, and super easy tasks. For easy and super-easy target functions, which lie in the reproducing kernel Hilbert space (RKHS) defined by the initial infinite-width Neural Tangent Kernel (NTK), the scaling exponents remain unchanged between feature learning and kernel regime models. For hard tasks, defined as those outside the RKHS of the initial NTK, we demonstrate both analytically and empirically that feature learning can improve scaling with training time and compute, nearly doubling the exponent for hard tasks. This leads to a different compute optimal strategy to scale parameters and training time in the feature learning regime. We support our finding that feature learning improves the scaling law for hard tasks but not for easy and super-easy tasks with experiments of nonlinear MLPs fitting functions with power-law Fourier spectra on the circle and CNNs learning vision tasks.

How Feature Learning Can Improve Neural Scaling Laws

TL;DR

The paper develops a solvable theory of neural scaling laws that incorporates feature learning beyond the kernel regime using a two-layer linear network with evolving features. By characterizing the data through power‑law spectra with capacity α and source β, it identifies hard, easy, and super‑easy regimes and shows that feature learning can nearly double the scaling exponent for hard tasks, while leaving easy tasks unchanged. The authors derive explicit DMFT equations, obtain scaling predictions, and validate them with experiments on Sobolev tasks and vision datasets, highlighting when feature learning improves efficiency under fixed compute. The work provides a framework to optimize compute allocation between model size and training time in the rich regime and discusses limitations and directions for extending these results to nonlinear and adaptive settings.

Abstract

We develop a solvable model of neural scaling laws beyond the kernel limit. Theoretical analysis of this model shows how performance scales with model size, training time, and the total amount of available data. We identify three scaling regimes corresponding to varying task difficulties: hard, easy, and super easy tasks. For easy and super-easy target functions, which lie in the reproducing kernel Hilbert space (RKHS) defined by the initial infinite-width Neural Tangent Kernel (NTK), the scaling exponents remain unchanged between feature learning and kernel regime models. For hard tasks, defined as those outside the RKHS of the initial NTK, we demonstrate both analytically and empirically that feature learning can improve scaling with training time and compute, nearly doubling the exponent for hard tasks. This leads to a different compute optimal strategy to scale parameters and training time in the feature learning regime. We support our finding that feature learning improves the scaling law for hard tasks but not for easy and super-easy tasks with experiments of nonlinear MLPs fitting functions with power-law Fourier spectra on the circle and CNNs learning vision tasks.
Paper Structure (55 sections, 77 equations, 13 figures, 1 table)

This paper contains 55 sections, 77 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Our model changes its scaling law exponents for hard tasks, where the source is sufficiently small $\beta < 1$. (a) The exponent $\chi(\beta)$ which appears in the loss scaling $\mathcal{L}(t) \sim t^{-\chi(\beta)}$ of our model. (b)-(c) Phase plots in the $\alpha, \beta$ plane of the observed scalings that give rise to the compute-optimal trade-off. Arrows $(\to)$ represent a transition from one scaling behavior to another as $t \to \infty$, where the balancing of these terms at fixed compute $C = N t$ gives the compute optimal scaling law. In the lazy limit $\gamma \to 0$, we recover the phase plot for $\alpha > 1$ of paquette20244+. At nonzero $\gamma$, however, we see that the set of "hard tasks", as given by $\beta < 1$ exhibits an improved scaling exponent. The compute optimal curves for the easy tasks with $\beta > 1$ are unchanged.
  • Figure 2: The learning dynamics of our model under power law features exhibits power law scaling with an exponent that depends on task difficulty. Dashed black lines represent solutions to the dynamical mean field theory (DMFT) while colored lines and shaded regions represent means and errorbars over $32$ random experiments. (a) For easy tasks with source exponent $\beta > 1$, the loss is improved with feature learning but the exponent of the power law is unchanged. We plot the approximation $\mathcal{L} \sim t^{-\beta}$ in blue. (b) For hard tasks where $\beta < 1$, the power law scaling exponent improves. An approximation of our learning curves predicts a new exponent $\mathcal{L} \sim t^{-\frac{2 \beta}{1+\beta}}$ which matches the exact $N,B \to \infty$ equations. (c)-(d) The mean field theory accurately captures the finite $N$ effects in both the easy and hard task regimes. As $N \to \infty$ the curve approaches $t^{- \beta \max\{1,\frac{2}{1+\beta}\} }$.
  • Figure 3: SGD Transients in feature learning regime. (a) In the hard regime, the SGD noise does not significantly alter the scaling behavior, but does add some additional variance to the predictor. As $B \to \infty$, the loss converges to the $t^{-2\beta/(1+\beta)}$ scaling. (b) In the super-easy regime, the model transitions from gradient flow scaling $t^{-\beta}$ to a SGD noise limited scaling $\frac{1}{B} t^{-2 + 1/\alpha}$ and finally to a finite $N$ transient scaling $\frac{1}{N} t^{-1+1/\alpha}$.
  • Figure 4: Compute optimal scalings in the feature learning regime ($\gamma = 0.75$). Dashed black lines are the full DMFT predictions. (a) In the $\beta< 1$ regime the compute optimal scaling law is determined by a trade-off between the bottleneck scalings for training time $t$ and model size $N$, giving $\mathcal{L}_\star(C) \sim C^{-\frac{\alpha\beta \chi}{\alpha \beta + \chi }}$ where $\chi = \frac{2\beta}{1+\beta}$ is the time-exponent for hard tasks in the rich-regime. (b) In the easy task regime $1 < \beta < 2 - \frac{1}{\alpha}$, the large $C$ scaling is determined by a competition between the bottleneck scaling in time $t$ and the leading order $1/N$ correction to the dynamics $\mathcal{L}_\star(C) \sim C^{- \frac{\alpha \beta}{\alpha \beta + 1}}$. (c) In the super-easy regime, the scaling exponent at large compute is derived by balancing the SGD noise effects with the $1/N$ transients.
  • Figure 5: Changing the target function's Fourier spectrum or the neural network can change the scaling law in nonlinear networks trained online. These MLPs are depth $4$ and width $512$. (a) Our predicted exponents are compared to SGD training in a ReLU network. The exponent $\beta$ is varied by changing $q$, the decay rate for the target function's Fourier spectrum. The scaling laws are well predicted by our toy model $t^{- \beta \max\{ 1, \frac{2}{1+\beta} \}}$. (b) The learning exponent for a fixed target function can also be manipulated by changing properties of the model such as the activation function $q_\phi$.
  • ...and 8 more figures