Table of Contents
Fetching ...

Scaling Laws in Linear Regression: Compute, Parameters, and Data

Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, Jason D. Lee

TL;DR

<3-5 sentence high-level summary> The paper provides a theoretical framework to reconcile neural scaling laws with statistical learning theory by analyzing an infinite-dimensional linear regression model trained with one-pass SGD on sketched covariates. It shows that, under a power-law spectrum and a Gaussian prior on the optimal parameter, the excess risk decomposes into approximation and bias terms that decay as M^{-(a-1)} and (N_eff)^{-(a-1)/a}, while the variance contribution is suppressed by implicit regularization and thus does not dominate the bound. The authors derive matching upper and lower bounds for finite model and data sizes, extend the results to source and logarithmic power-law settings, and validate the theory with experiments that align the empirical exponents with the predicted values. They further discuss optimal data-model allocation under compute constraints and connect SGD’s implicit bias to the observed scaling laws, offering a principled explanation for the widespread neural scaling phenomena. These insights provide a theoretical bridge between empirical scaling laws and classical statistical learning theory with practical implications for compute-efficient training strategies.

Abstract

Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, which predict that increasing model size monotonically improves performance. We study the theory of scaling laws in an infinite dimensional linear regression setup. Specifically, we consider a model with $M$ parameters as a linear function of sketched covariates. The model is trained by one-pass stochastic gradient descent (SGD) using $N$ data. Assuming the optimal parameter satisfies a Gaussian prior and the data covariance matrix has a power-law spectrum of degree $a>1$, we show that the reducible part of the test error is $Θ(M^{-(a-1)} + N^{-(a-1)/a})$. The variance error, which increases with $M$, is dominated by the other errors due to the implicit regularization of SGD, thus disappearing from the bound. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.

Scaling Laws in Linear Regression: Compute, Parameters, and Data

TL;DR

<3-5 sentence high-level summary> The paper provides a theoretical framework to reconcile neural scaling laws with statistical learning theory by analyzing an infinite-dimensional linear regression model trained with one-pass SGD on sketched covariates. It shows that, under a power-law spectrum and a Gaussian prior on the optimal parameter, the excess risk decomposes into approximation and bias terms that decay as M^{-(a-1)} and (N_eff)^{-(a-1)/a}, while the variance contribution is suppressed by implicit regularization and thus does not dominate the bound. The authors derive matching upper and lower bounds for finite model and data sizes, extend the results to source and logarithmic power-law settings, and validate the theory with experiments that align the empirical exponents with the predicted values. They further discuss optimal data-model allocation under compute constraints and connect SGD’s implicit bias to the observed scaling laws, offering a principled explanation for the widespread neural scaling phenomena. These insights provide a theoretical bridge between empirical scaling laws and classical statistical learning theory with practical implications for compute-efficient training strategies.

Abstract

Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, which predict that increasing model size monotonically improves performance. We study the theory of scaling laws in an infinite dimensional linear regression setup. Specifically, we consider a model with parameters as a linear function of sketched covariates. The model is trained by one-pass stochastic gradient descent (SGD) using data. Assuming the optimal parameter satisfies a Gaussian prior and the data covariance matrix has a power-law spectrum of degree , we show that the reducible part of the test error is . The variance error, which increases with , is dominated by the other errors due to the implicit regularization of SGD, thus disappearing from the bound. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
Paper Structure (76 sections, 37 theorems, 258 equations, 12 figures)

This paper contains 76 sections, 37 theorems, 258 equations, 12 figures.

Key Result

Theorem 4.1

Suppose that assump:simpleassump:power-law hold. Consider an $M$-dimensional sketched predictor trained by eq:sgd with $N$ samples. Let $N_\mathtt{eff} := N/\log(N)$ and recall the risk decomposition in eq:approx-excess-decomp. Then there exists some $a$-dependent constant $c>0$ such that when the In all results, the hidden constants only depend on the power-law degree $a$. As a direct consequen

Figures (12)

  • Figure 1: The expected risk (Risk) of the last iterate of \ref{['eq:sgd']} versus the effective sample size $N_\mathtt{eff}$ and the model size $M$ for different power-law degrees $a$. The expected risk is computed by averaging over $1000$ independent samples of $(\mathbf{w}^*,\mathbf{S})$. We fit the expected risk using the formula $\text{Risk}\sim\sigma^2+c_1/M^{{a}_1}+c_2/N^{{a}_2}$ via minimizing the Huber loss as in hoffmann2022training. Parameters: $\sigma=1,\gamma =0.1$. Left: For $a=1.5$, $d=20000$, the fitted exponents are $({a}_1,{a}_2)=(0.54,0.34)\approx(0.5,0.33)$. Right: For $a=2$, $d=2000$, the fitted exponents are $({a}_1,{a}_2)=(1.07,0.49)\approx(1.0,0.5)$. Note that the values of $({a}_1,{a}_2)$ are close to our theoretical predictions $({a}-1,1-1/{a})$ in both cases, verifying the sharpness of our risk bounds. More details can be found in \ref{['sec:scaling_law_examples', 'sec:exp']}.
  • Figure 3: The expected risk (Risk) of the average of iterates of \ref{['eq:sgd']} versus the sample size $N$ and the model size $M$ for different power-law degrees $a$. The expected risk is computed by averaging over $1000$ independent samples of $(\mathbf{w}^*,\mathbf{S})$. We fit the expected risk using the formula $\text{Risk}\sim\sigma^2+c_1/M^{{a}_1}+c_2/N^{{a}_2}$ via minimizing the Huber loss as in hoffmann2022training. Parameters: $\sigma=1,\gamma =0.1$. Left: For $a=1.5$, $d=20000$, the fitted exponents are $({a}_1,{a}_2)=(0.59,0.33)\approx(0.5,0.33)$. Right: For $a=2$, $d=2000$, the fitted exponents are $({a}_1,{a}_2)=(1.09,0.49)\approx(1.0,0.5)$. Note that the values of $({a}_1,{a}_2)$ are close to our theoretical predictions $({a}-1,1-1/{a})$ in both cases.
  • Figure : (a) $a=1.5$
  • Figure : (a) $a=1.5$
  • Figure : (a) $a=1.5$
  • ...and 7 more figures

Theorems & Definitions (62)

  • Definition 1: Data covariance and optimal parameter
  • Theorem 4.1: Scaling law
  • Theorem 4.2: Scaling law under source condition
  • Theorem 4.3: Scaling law under logarithmic power spectrum
  • Theorem 6.1: Excess risk decomposition
  • Lemma 6.2: Power law
  • Theorem 6.3: A general upper bound
  • Theorem 6.4: A general lower bound
  • Lemma A.1: Approximization error
  • proof : Proof of \ref{['lemma:approximation']}
  • ...and 52 more