Scaling Laws in Linear Regression: Compute, Parameters, and Data
Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, Jason D. Lee
TL;DR
<3-5 sentence high-level summary> The paper provides a theoretical framework to reconcile neural scaling laws with statistical learning theory by analyzing an infinite-dimensional linear regression model trained with one-pass SGD on sketched covariates. It shows that, under a power-law spectrum and a Gaussian prior on the optimal parameter, the excess risk decomposes into approximation and bias terms that decay as M^{-(a-1)} and (N_eff)^{-(a-1)/a}, while the variance contribution is suppressed by implicit regularization and thus does not dominate the bound. The authors derive matching upper and lower bounds for finite model and data sizes, extend the results to source and logarithmic power-law settings, and validate the theory with experiments that align the empirical exponents with the predicted values. They further discuss optimal data-model allocation under compute constraints and connect SGD’s implicit bias to the observed scaling laws, offering a principled explanation for the widespread neural scaling phenomena. These insights provide a theoretical bridge between empirical scaling laws and classical statistical learning theory with practical implications for compute-efficient training strategies.
Abstract
Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, which predict that increasing model size monotonically improves performance. We study the theory of scaling laws in an infinite dimensional linear regression setup. Specifically, we consider a model with $M$ parameters as a linear function of sketched covariates. The model is trained by one-pass stochastic gradient descent (SGD) using $N$ data. Assuming the optimal parameter satisfies a Gaussian prior and the data covariance matrix has a power-law spectrum of degree $a>1$, we show that the reducible part of the test error is $Θ(M^{-(a-1)} + N^{-(a-1)/a})$. The variance error, which increases with $M$, is dominated by the other errors due to the implicit regularization of SGD, thus disappearing from the bound. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
