Scaling Laws in Linear Regression: Compute, Parameters, and Data

Licong Lin; Jingfeng Wu; Sham M. Kakade; Peter L. Bartlett; Jason D. Lee

Scaling Laws in Linear Regression: Compute, Parameters, and Data

Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, Jason D. Lee

TL;DR

<3-5 sentence high-level summary> The paper provides a theoretical framework to reconcile neural scaling laws with statistical learning theory by analyzing an infinite-dimensional linear regression model trained with one-pass SGD on sketched covariates. It shows that, under a power-law spectrum and a Gaussian prior on the optimal parameter, the excess risk decomposes into approximation and bias terms that decay as M^{-(a-1)} and (N_eff)^{-(a-1)/a}, while the variance contribution is suppressed by implicit regularization and thus does not dominate the bound. The authors derive matching upper and lower bounds for finite model and data sizes, extend the results to source and logarithmic power-law settings, and validate the theory with experiments that align the empirical exponents with the predicted values. They further discuss optimal data-model allocation under compute constraints and connect SGD’s implicit bias to the observed scaling laws, offering a principled explanation for the widespread neural scaling phenomena. These insights provide a theoretical bridge between empirical scaling laws and classical statistical learning theory with practical implications for compute-efficient training strategies.

Abstract

Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, which predict that increasing model size monotonically improves performance. We study the theory of scaling laws in an infinite dimensional linear regression setup. Specifically, we consider a model with $M$ parameters as a linear function of sketched covariates. The model is trained by one-pass stochastic gradient descent (SGD) using $N$ data. Assuming the optimal parameter satisfies a Gaussian prior and the data covariance matrix has a power-law spectrum of degree $a>1$, we show that the reducible part of the test error is $Θ(M^{-(a-1)} + N^{-(a-1)/a})$. The variance error, which increases with $M$, is dominated by the other errors due to the implicit regularization of SGD, thus disappearing from the bound. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.

Scaling Laws in Linear Regression: Compute, Parameters, and Data

TL;DR

Abstract

parameters as a linear function of sketched covariates. The model is trained by one-pass stochastic gradient descent (SGD) using

data. Assuming the optimal parameter satisfies a Gaussian prior and the data covariance matrix has a power-law spectrum of degree

, we show that the reducible part of the test error is

. The variance error, which increases with

, is dominated by the other errors due to the implicit regularization of SGD, thus disappearing from the bound. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.

Paper Structure (76 sections, 37 theorems, 258 equations, 12 figures)

This paper contains 76 sections, 37 theorems, 258 equations, 12 figures.

Introduction
A mystery.
Our explanation.
Emprical evidence.
Notation.
Related work
Empirical scaling laws.
Theory of scaling laws.
Implicit regularization of SGD.
Setup
Risk decomposition.
Scaling laws
Optimal stepsize.
Allocation of data and model sizes.
Comparison with bordelon2024dynamical.
...and 61 more sections

Key Result

Theorem 4.1

Suppose that assump:simpleassump:power-law hold. Consider an $M$-dimensional sketched predictor trained by eq:sgd with $N$ samples. Let $N_\mathtt{eff} := N/\log(N)$ and recall the risk decomposition in eq:approx-excess-decomp. Then there exists some $a$-dependent constant $c>0$ such that when the In all results, the hidden constants only depend on the power-law degree $a$. As a direct consequen

Figures (12)

Figure 1: The expected risk (Risk) of the last iterate of \ref{['eq:sgd']} versus the effective sample size $N_\mathtt{eff}$ and the model size $M$ for different power-law degrees $a$. The expected risk is computed by averaging over $1000$ independent samples of $(\mathbf{w}^*,\mathbf{S})$. We fit the expected risk using the formula $\text{Risk}\sim\sigma^2+c_1/M^{{a}_1}+c_2/N^{{a}_2}$ via minimizing the Huber loss as in hoffmann2022training. Parameters: $\sigma=1,\gamma =0.1$. Left: For $a=1.5$, $d=20000$, the fitted exponents are $({a}_1,{a}_2)=(0.54,0.34)\approx(0.5,0.33)$. Right: For $a=2$, $d=2000$, the fitted exponents are $({a}_1,{a}_2)=(1.07,0.49)\approx(1.0,0.5)$. Note that the values of $({a}_1,{a}_2)$ are close to our theoretical predictions $({a}-1,1-1/{a})$ in both cases, verifying the sharpness of our risk bounds. More details can be found in \ref{['sec:scaling_law_examples', 'sec:exp']}.
Figure 3: The expected risk (Risk) of the average of iterates of \ref{['eq:sgd']} versus the sample size $N$ and the model size $M$ for different power-law degrees $a$. The expected risk is computed by averaging over $1000$ independent samples of $(\mathbf{w}^*,\mathbf{S})$. We fit the expected risk using the formula $\text{Risk}\sim\sigma^2+c_1/M^{{a}_1}+c_2/N^{{a}_2}$ via minimizing the Huber loss as in hoffmann2022training. Parameters: $\sigma=1,\gamma =0.1$. Left: For $a=1.5$, $d=20000$, the fitted exponents are $({a}_1,{a}_2)=(0.59,0.33)\approx(0.5,0.33)$. Right: For $a=2$, $d=2000$, the fitted exponents are $({a}_1,{a}_2)=(1.09,0.49)\approx(1.0,0.5)$. Note that the values of $({a}_1,{a}_2)$ are close to our theoretical predictions $({a}-1,1-1/{a})$ in both cases.
Figure : (a) $a=1.5$
Figure : (a) $a=1.5$
Figure : (a) $a=1.5$
...and 7 more figures

Theorems & Definitions (62)

Definition 1: Data covariance and optimal parameter
Theorem 4.1: Scaling law
Theorem 4.2: Scaling law under source condition
Theorem 4.3: Scaling law under logarithmic power spectrum
Theorem 6.1: Excess risk decomposition
Lemma 6.2: Power law
Theorem 6.3: A general upper bound
Theorem 6.4: A general lower bound
Lemma A.1: Approximization error
proof : Proof of \ref{['lemma:approximation']}
...and 52 more

Scaling Laws in Linear Regression: Compute, Parameters, and Data

TL;DR

Abstract

Scaling Laws in Linear Regression: Compute, Parameters, and Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (62)