Table of Contents
Fetching ...

Asymptotics of Random Feature Regression Beyond the Linear Scaling Regime

Hong Hu, Yue M. Lu, Theodor Misiakiewicz

TL;DR

This work analyzes random feature ridge regression in the polynomial high-dimensional regime where $p/d^{\kappa_1}\to\theta_1$ and $n/d^{\kappa_2}\to\theta_2$, deriving sharp asymptotics for the test error that reveal a precise interaction between approximation and statistical errors. The authors establish a Gaussian covariate equivalence and express the limiting risk in terms of polynomial degree learnability up to $\ell-1$ (and possibly $\ell$) determined by the scaling, with a fixed-point framework governing the asymptotics. They show a multi-phase learning behavior, including a double-descent peak at $n=p$ under certain conditions, and identify regimes where overparametrization either improves toward KRR performance or leaves the limit at the random-feature approximation error. The results provide a detailed lens on how parameter count, sample size, regularization, and activation nonlinearity jointly shape generalization in modern overparametrized models. Overall, the paper advances understanding of model complexity beyond linear scaling and offers precise guidance on how to choose $p$ relative to $n$ for optimal test error in high-dimensional settings.

Abstract

Recent advances in machine learning have been achieved by using overparametrized models trained until near interpolation of the training data. It was shown, e.g., through the double descent phenomenon, that the number of parameters is a poor proxy for the model complexity and generalization capabilities. This leaves open the question of understanding the impact of parametrization on the performance of these models. How does model complexity and generalization depend on the number of parameters $p$? How should we choose $p$ relative to the sample size $n$ to achieve optimal test error? In this paper, we investigate the example of random feature ridge regression (RFRR). This model can be seen either as a finite-rank approximation to kernel ridge regression (KRR), or as a simplified model for neural networks trained in the so-called lazy regime. We consider covariates uniformly distributed on the $d$-dimensional sphere and compute sharp asymptotics for the RFRR test error in the high-dimensional polynomial scaling, where $p,n,d \to \infty$ while $p/ d^{κ_1}$ and $n / d^{κ_2}$ stay constant, for all $κ_1 , κ_2 \in \mathbb{R}_{>0}$. These asymptotics precisely characterize the impact of the number of random features and regularization parameter on the test performance. In particular, RFRR exhibits an intuitive trade-off between approximation and generalization power. For $n = o(p)$, the sample size $n$ is the bottleneck and RFRR achieves the same performance as KRR (which is equivalent to taking $p = \infty$). On the other hand, if $p = o(n)$, the number of random features $p$ is the limiting factor and RFRR test error matches the approximation error of the random feature model class (akin to taking $n = \infty$). Finally, a double descent appears at $n= p$, a phenomenon that was previously only characterized in the linear scaling $κ_1 = κ_2 = 1$.

Asymptotics of Random Feature Regression Beyond the Linear Scaling Regime

TL;DR

This work analyzes random feature ridge regression in the polynomial high-dimensional regime where and , deriving sharp asymptotics for the test error that reveal a precise interaction between approximation and statistical errors. The authors establish a Gaussian covariate equivalence and express the limiting risk in terms of polynomial degree learnability up to (and possibly ) determined by the scaling, with a fixed-point framework governing the asymptotics. They show a multi-phase learning behavior, including a double-descent peak at under certain conditions, and identify regimes where overparametrization either improves toward KRR performance or leaves the limit at the random-feature approximation error. The results provide a detailed lens on how parameter count, sample size, regularization, and activation nonlinearity jointly shape generalization in modern overparametrized models. Overall, the paper advances understanding of model complexity beyond linear scaling and offers precise guidance on how to choose relative to for optimal test error in high-dimensional settings.

Abstract

Recent advances in machine learning have been achieved by using overparametrized models trained until near interpolation of the training data. It was shown, e.g., through the double descent phenomenon, that the number of parameters is a poor proxy for the model complexity and generalization capabilities. This leaves open the question of understanding the impact of parametrization on the performance of these models. How does model complexity and generalization depend on the number of parameters ? How should we choose relative to the sample size to achieve optimal test error? In this paper, we investigate the example of random feature ridge regression (RFRR). This model can be seen either as a finite-rank approximation to kernel ridge regression (KRR), or as a simplified model for neural networks trained in the so-called lazy regime. We consider covariates uniformly distributed on the -dimensional sphere and compute sharp asymptotics for the RFRR test error in the high-dimensional polynomial scaling, where while and stay constant, for all . These asymptotics precisely characterize the impact of the number of random features and regularization parameter on the test performance. In particular, RFRR exhibits an intuitive trade-off between approximation and generalization power. For , the sample size is the bottleneck and RFRR achieves the same performance as KRR (which is equivalent to taking ). On the other hand, if , the number of random features is the limiting factor and RFRR test error matches the approximation error of the random feature model class (akin to taking ). Finally, a double descent appears at , a phenomenon that was previously only characterized in the linear scaling .
Paper Structure (60 sections, 45 theorems, 500 equations, 7 figures)

This paper contains 60 sections, 45 theorems, 500 equations, 7 figures.

Key Result

Theorem 1

Assume $(p(d),n(d))_{d \geq 1}$ are two sequences of integers such that $p/d^{\kappa_1} \to \theta_1$ and $n / d^{\kappa_2} \to \theta_2$ for some $\kappa_1,\kappa_2,\theta_1,\theta_2 \in \mathbb{R}_{>0}$, and denote $\ell = \lceil \min (\kappa_1 , \kappa_2) \rceil$. Let $\{ f_{*,d} \in L^2 (\mathbb and where $(\mathcal{B}_{{\sf test}}, {\mathcal{V}}_{{\sf test}}, \alpha_c)$ and $(\mathcal{B}_{\s

Figures (7)

  • Figure 1: Cartoon illustration of the test error of RFRR in the high-dimensional polynomial scaling $p/d^{\kappa_1} \to \theta_1$ and $n/d^{\kappa_2} \to \theta_2$ as $p,n,d \to \infty$, for $\kappa_1,\kappa_2,\theta_1 , \theta_2 \in \mathbb{R}_{>0}$. Top: test error of RFRR versus $\log(n)/\log(d)$ for fixed $p$. Bottom left: approximation error ($n= \infty$) of random feature models versus $\log(p)/\log(d)$. Bottom right: test error of KRR ($p = \infty$) versus $\log (n) / \log(d)$. The approximation error (resp. KRR test error) follows a staircase decay where each time $\log(p)/\log(d)$ (resp. $\log(n)/\log(d)$) crosses an integer value, the RF model fits one more degree polynomial approximation to the target function. Peaks can appear in the KRR risk curve at $n = d^\ell/\ell!, \ell \in {\mathbb N}$, depending on some effective regularization and effective signal-to-noise ratio at that scale. The RFRR test error first follows the KRR test error for $n \ll p$, then presents a peak at the interpolation threshold $n = p$, before saturating on the approximation error for $n \gg p$.
  • Figure 3: Test error for different regularization parameters $\lambda$ in the polynomial scaling $\kappa_1 = \kappa_2 = \ell = 2$. Here, $f_{*,d}({\boldsymbol x}) = 2 q_2^{(d)}({\boldsymbol x}^{\mathsf T} {\boldsymbol \beta})$ with $\|{\boldsymbol \beta}\|=1$, $\sigma(x) = 0.5 q_2^{(d)}(x) + 0.5 q_3^{(d)}(x)$ and $\text{SNR}:= \| f_* \|_{L^2}^2 / \rho_\varepsilon^2$. Left figure:$n/d^2=10$ and $\text{SNR} = 5$. Middle figure:$p/d^2=10$ and $\text{SNR} = 5$. Right figure:$p/d^2=10$ and $n/d^2=1$.
  • Figure 4: Test error and training error of RFRR in the critical regime $\kappa_1 = \kappa_2$. We choose the target function to be $f_{*,d}({\boldsymbol x}) = 0.5 {\boldsymbol \beta}^{\mathsf T}{\boldsymbol x} + 1.5({\boldsymbol \beta}^{\mathsf T}{\boldsymbol x})^2 + ({\boldsymbol \beta}^{\mathsf T}{\boldsymbol x})^3$ with $\|{\boldsymbol \beta}\|_2=1$, and the activation function $\sigma(x) = 1.5x + 3x^2 + 2x^3$. We set $\lambda = 1.0$ and $\rho_\varepsilon^2 = 0.25$. The solid lines correspond to the analytical predictions for the test and training errors obtained in Theorem \ref{['thm:main_theorem_RF']}, the purple dashed line to the analytical predictions for the KRR test error, and the grey dashed lines to the values of the projections $\| {\mathsf P}_{>1} f_* \|_{L^2}^2$ and $\| {\mathsf P}_{>2} f_* \|_{L^2}^2$. The dots are the empirical results with $d = 50$, and the mean and error bars are computed over 100 independent runs. On the left: we set $\kappa_1 = \kappa_2 = 2$ and $\psi_2= 2n/d^2 =1$ and plot the errors versus $\psi_1 = 2p/d^2$. In the middle: we set $\kappa_1 = \kappa_2 = 2$ and $\psi_1 = 2p/d^2=1$, and plot the errors versus $\psi_2 = 2n/d^2$. On the right: we set $\kappa_1 = \kappa_2 = 1.5$ and $\theta_1 = p/d^{1.5} = 1$, and plot the errors versus $\theta_2 = n / d^{1.5}$.
  • Figure 5: Test error and training error of RFRR in the overparametrized regime $\kappa_1 > \kappa_2$ (left) and underparametrized regime $\kappa_1 < \kappa_2$ (right). We choose the target function to be $f_{*,d}({\boldsymbol x}) = {\boldsymbol \beta}^{\mathsf T}{\boldsymbol x} + ({\boldsymbol \beta}^{\mathsf T}{\boldsymbol x})^2$, $\| {\boldsymbol \beta} \|_2 = 1$, and the activation function $\sigma(x) = x + 0.1 x^2$. The solid lines correspond to the analytical predictions for the test and training errors obtained in Theorem \ref{['thm:main_theorem_RF']}, and the grey dashed lines to the values of the projections $\| {\mathsf P}_{>0} f_* \|_{L^2}^2$ and $\| {\mathsf P}_{>1} f_* \|_{L^2}^2$. The dots are the empirical results with $d = 50$, and the mean and error bars are computed over 100 independent runs. On the left: we set $\kappa_1 = 2$, $\kappa_2 = 1$, and $\psi_1 = 2$, and plot the errors versus $\psi_2 = n/d$. On the right: we set $\kappa_1 = 1$, $\kappa_2 = 2$, and $\psi_2 = 2$, and plot the errors versus $\psi_1 = p/d$.
  • Figure 6: Illustration of the bias (dash curves) and variance (dash-dot curves) decomposition and the incremental learning process of RFRR. We plot the analytical asymptotic predictions from Definition \ref{['def:asymptotic_formula_RFRR']} versus $p$, while $n$ is kept fixed with $\kappa_2 = 2$ and three values of $\psi_2= 2n/d^2 \in \{0.1, 1, 10\}$. We set $\lambda = \rho_\varepsilon^2 =1$, $\sigma(x)=x+x^2+x^3+x^4$ and $f_*({\boldsymbol x}) = 0.5 {\boldsymbol \beta}^{\mathsf T} {\boldsymbol x} + 0.5 ({\boldsymbol \beta}^{\mathsf T} {\boldsymbol x})^2 + 0.5 ({\boldsymbol \beta}^{\mathsf T} {\boldsymbol x})^3$. The dotted lines correspond to the squared norm of each of the frequencies of $f_*$. Recall that $R_{{\sf App}}$ denotes the approximation error.
  • ...and 2 more figures

Theorems & Definitions (90)

  • Definition 1: Fixed points at level $\ell \in {\mathbb N}$
  • Definition 2: Asymptotic formulas for RFRR
  • Theorem 1: RFRR asymptotics
  • Theorem 2: Gaussian equivalent model
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • Definition 3: Fixed points formula
  • ...and 80 more