How Feature Learning Can Improve Neural Scaling Laws
Blake Bordelon, Alexander Atanasov, Cengiz Pehlevan
TL;DR
The paper develops a solvable theory of neural scaling laws that incorporates feature learning beyond the kernel regime using a two-layer linear network with evolving features. By characterizing the data through power‑law spectra with capacity α and source β, it identifies hard, easy, and super‑easy regimes and shows that feature learning can nearly double the scaling exponent for hard tasks, while leaving easy tasks unchanged. The authors derive explicit DMFT equations, obtain scaling predictions, and validate them with experiments on Sobolev tasks and vision datasets, highlighting when feature learning improves efficiency under fixed compute. The work provides a framework to optimize compute allocation between model size and training time in the rich regime and discusses limitations and directions for extending these results to nonlinear and adaptive settings.
Abstract
We develop a solvable model of neural scaling laws beyond the kernel limit. Theoretical analysis of this model shows how performance scales with model size, training time, and the total amount of available data. We identify three scaling regimes corresponding to varying task difficulties: hard, easy, and super easy tasks. For easy and super-easy target functions, which lie in the reproducing kernel Hilbert space (RKHS) defined by the initial infinite-width Neural Tangent Kernel (NTK), the scaling exponents remain unchanged between feature learning and kernel regime models. For hard tasks, defined as those outside the RKHS of the initial NTK, we demonstrate both analytically and empirically that feature learning can improve scaling with training time and compute, nearly doubling the exponent for hard tasks. This leads to a different compute optimal strategy to scale parameters and training time in the feature learning regime. We support our finding that feature learning improves the scaling law for hard tasks but not for easy and super-easy tasks with experiments of nonlinear MLPs fitting functions with power-law Fourier spectra on the circle and CNNs learning vision tasks.
