Scaling and renormalization in high-dimensional regression

Alexander Atanasov; Jacob A. Zavatone-Veth; Cengiz Pehlevan

Scaling and renormalization in high-dimensional regression

Alexander Atanasov, Jacob A. Zavatone-Veth, Cengiz Pehlevan

TL;DR

This work presents a unifying framework based on random matrix theory and free probability to derive precise training and generalization asymptotics for high-dimensional ridge regression, kernel methods, and linear/random feature models. Central to the approach is the deterministic equivalence and the S-transform, which renormalizes the ridge parameter to absorb covariance fluctuations, yielding closed-form expressions and sharp scaling laws. The authors systematically map out how different data and feature covariances, including isotropic, structured, and additive feature-noise scenarios, give rise to diverse regimes such as double descent, variance-dominated scaling, and ridge-dominated transitions, connecting these phenomena to neural scaling laws. The framework also provides practical tools for out-of-sample risk estimation (GCV/KARE) using training data alone and extends to deep linear/random feature architectures, offering a coherent lens on when and why overparameterized systems generalize well. Overall, the paper delivers a rigorous, scalable theory that links renormalization in random covariances to observable learning curves across a broad class of high-dimensional regression models, with implications for understanding and predicting scaling in neural networks.

Abstract

From benign overfitting in overparameterized models to rich power-law scalings in performance, simple ridge regression displays surprising behaviors sometimes thought to be limited to deep neural networks. This balance of phenomenological richness with analytical tractability makes ridge regression the model system of choice in high-dimensional machine learning. In this paper, we present a unifying perspective on recent results on ridge regression using the basic tools of random matrix theory and free probability, aimed at readers with backgrounds in physics and deep learning. We highlight the fact that statistical fluctuations in empirical covariance matrices can be absorbed into a renormalization of the ridge parameter. This `deterministic equivalence' allows us to obtain analytic formulas for the training and generalization errors in a few lines of algebra by leveraging the properties of the $S$-transform of free probability. From these precise asymptotics, we can easily identify sources of power-law scaling in model performance. In all models, the $S$-transform corresponds to the train-test generalization gap, and yields an analogue of the generalized-cross-validation estimator. Using these techniques, we derive fine-grained bias-variance decompositions for a very general class of random feature models with structured covariates. This allows us to discover a scaling regime for random feature models where the variance due to the features limits performance in the overparameterized setting. We also demonstrate how anisotropic weight structure in random feature models can limit performance and lead to nontrivial exponents for finite-width corrections in the overparameterized setting. Our results extend and provide a unifying perspective on earlier models of neural scaling laws.

Scaling and renormalization in high-dimensional regression

TL;DR

Abstract

-transform of free probability. From these precise asymptotics, we can easily identify sources of power-law scaling in model performance. In all models, the

-transform corresponds to the train-test generalization gap, and yields an analogue of the generalized-cross-validation estimator. Using these techniques, we derive fine-grained bias-variance decompositions for a very general class of random feature models with structured covariates. This allows us to discover a scaling regime for random feature models where the variance due to the features limits performance in the overparameterized setting. We also demonstrate how anisotropic weight structure in random feature models can limit performance and lead to nontrivial exponents for finite-width corrections in the overparameterized setting. Our results extend and provide a unifying perspective on earlier models of neural scaling laws.

Paper Structure (76 sections, 284 equations, 17 figures)

This paper contains 76 sections, 284 equations, 17 figures.

Introduction
Review of Neural Scaling Laws
Overview and Contributions
Code Availability
Random Matrix Models of Empirical Covariance Matrices
Motivation: Empirical Covariance Matrices
Examples of Random Matrices
The Spectral Density and the Resolvent
Degrees of Freedom
Addition and Multiplication of Random Matrices
R-transform
S-transform
Subordination Relations and Strong Deterministic Equivalence
Summary of R and S transform identities
Application: Empirical Covariances
...and 61 more sections

Figures (17)

Figure 1: Linear regression on unstructured covariates, i.e.$\bm \Sigma = \mathbf I$. Left: we plot theory (solid lines) for the various quantities of interest $\kappa, \gamma, \mathrm{df}_1, \mathrm{df}_2$. We also plot the empirical estimate of $\mathrm{df}_1$, namely $\mathrm{df}_{\hat{\bm \Sigma}}(\lambda)$. Using this, we estimate of $\kappa_1$ using the training set and find excellent agreement. Right: We plot the training and generalization (blue, black respectively) as well as the bias (green) and variances (orange, red) due to the dataset and label noise. Dots and error bars indicate empirical simulations over 20 seeds over the training set. Solid curves show theory. We find excellent agreement for all relevant quantities. The GCV estimator is plotted as orchid triangles and again we find strong agreement with the generalization error. Here, $\lambda=10^{-3}$.
Figure 2: Double descent without label noise in a linear regression task. Here, $\bm \Sigma$ has an eigenspectrum with eigenvalues $\eta_1, \eta_2, \eta_3$ that have values $1, 10^{-2}, 10^{-5}$ and multiplicities $10, 10^2, 10^4$ respectively. The dashed line indicates when $P \approx N_k$. The teacher $\bar{\bm w}$ has increasing power in higher modes, given by $1, 10, 10^2$ respectively. The fact that the higher modes are not learnable leads to an effective label-noise like effect that causes this multiple descent phenomenon. We stress that the variance $\mathrm{Var}_{\bm X, \bm \epsilon} = 0$ since there is no label noise.
Figure 3: Schematic of the bias-variance decomposition for linear regression. The color scheme matches the plots in Figures \ref{['fig:unstructured_LR_linspace']}, \ref{['fig:multiple_descent']} and \ref{['fig:structured_LR_ab']}. Grey regions do not contribute to variance.
Figure 4: Left: Scaling of various relevant parameters for power-law structured data. The analytic solution for $\kappa$ is plotted (solid black), as well as its GCV estimate from the data given by $S(-\mathrm{df}_{\hat{\bm \Sigma}}(\lambda)) \lambda$ (orchid triangles). The scaling law $P^{-\alpha}$ is also plotted (dashed black), showing excellent agreement. We also plot $\mathrm{df}^1_{\bm \Sigma}(\kappa)$ (solid blue) and its empirical estimate $\mathrm{df}^1_{\hat{\bm \Sigma}}(\lambda)$ (blue circles), finding excellent agreement. We also plot the scaling law $P/N$ (dashed blue). Finally, we plot $\mathrm{df}_2$ and $\gamma = \frac{P}{N} \mathrm{df}_2$ (dashed green and yellow respectively). We see that $\gamma$ is relatively constant across $P$. For faster decays it would be more constant still. Right: The same, with faster spectral decay. We find agreement until $\kappa \sim \lambda$, where we enter the ridge-dominated scaling regime highlighted in Equation \ref{['eq:LR_Eg_scaling']}.
Figure 5: Generalization error (solid black) for two different teacher decay constants. We see that $\mathrm{min}(1, r)$ determines the whether the scaling law is due solely to the capacity or if the source also plays a role. The bias (solid green) variance over the dataset (solid orange) follow identical scaling laws. The results of empirical simulations are plotted in solid dots, showing excellent agreement. The GCV estimate from the training error is given by orchid triangles. Here, $N=10000$, and the spectral decay makes the final result insensitive to $N$.
...and 12 more figures

Theorems & Definitions (4)

Example 1: White Wishart Matrices
Example 2: Structured Wishart Matrices and Multiplicative Noise
Example 3: Wigner Matrices as Additive Noise
Example 4: Random Projection

Scaling and renormalization in high-dimensional regression

TL;DR

Abstract

Scaling and renormalization in high-dimensional regression

Authors

TL;DR

Abstract

Table of Contents

Figures (17)

Theorems & Definitions (4)