Prediction Risk and Estimation Risk of the Ridgeless Least Squares Estimator under General Assumptions on Regression Errors

Sungyoon Lee; Sokbae Lee

Prediction Risk and Estimation Risk of the Ridgeless Least Squares Estimator under General Assumptions on Regression Errors

Sungyoon Lee, Sokbae Lee

TL;DR

This work studies the prediction and estimation risk of the ridgeless least squares estimator in overparameterized linear models under general regression-error structures. It derives exact finite-sample variance expressions that separate the dependence on the error covariance $\Omega$ from the design via left-spherical symmetry, showing $\mathbb{E}_X[\mathrm{Var}_\Sigma(\hat{β}\mid X)]=\frac{1}{n}\mathrm{Tr}(\Omega)\mathbb{E}_X[\mathrm{Tr}((X^\top X)^{\dagger}\Sigma)]$ and $\mathbb{E}_X[\mathrm{Var}(\hat{β}\mid X)]=\frac{1}{np}\mathrm{Tr}(\Omega)\mathbb{E}_X[\mathrm{Tr}(\Lambda^{\dagger})]$. The bias components are obtained under random-effects-type assumptions, yielding closed forms for $R_P(\hat{β})$ and $R_E(\hat{β})$, and an asymptotic analysis based on the Stieltjes transform $s^*$ that reveals a double-descent pattern in the estimation risk. The results are supported by numerical experiments with AR$(1)$ and clustered errors and suggest that overparameterization benefits extend to time series, panel, and grouped data. Overall, the paper provides a realistic finite-sample framework for ridgeless interpolation under correlated errors and connects these finite-sample results to high-dimensional asymptotics through $s^*$.

Abstract

In recent years, there has been a significant growth in research focusing on minimum $\ell_2$ norm (ridgeless) interpolation least squares estimators. However, the majority of these analyses have been limited to an unrealistic regression error structure, assuming independent and identically distributed errors with zero mean and common variance. In this paper, we explore prediction risk as well as estimation risk under more general regression error assumptions, highlighting the benefits of overparameterization in a more realistic setting that allows for clustered or serial dependence. Notably, we establish that the estimation difficulties associated with the variance components of both risks can be summarized through the trace of the variance-covariance matrix of the regression errors. Our findings suggest that the benefits of overparameterization can extend to time series, panel and grouped data.

Prediction Risk and Estimation Risk of the Ridgeless Least Squares Estimator under General Assumptions on Regression Errors

TL;DR

from the design via left-spherical symmetry, showing

and

. The bias components are obtained under random-effects-type assumptions, yielding closed forms for

and

, and an asymptotic analysis based on the Stieltjes transform

that reveals a double-descent pattern in the estimation risk. The results are supported by numerical experiments with AR

and clustered errors and suggest that overparameterization benefits extend to time series, panel, and grouped data. Overall, the paper provides a realistic finite-sample framework for ridgeless interpolation under correlated errors and connects these finite-sample results to high-dimensional asymptotics through

Abstract

In recent years, there has been a significant growth in research focusing on minimum

norm (ridgeless) interpolation least squares estimators. However, the majority of these analyses have been limited to an unrealistic regression error structure, assuming independent and identically distributed errors with zero mean and common variance. In this paper, we explore prediction risk as well as estimation risk under more general regression error assumptions, highlighting the benefits of overparameterization in a more realistic setting that allows for clustered or serial dependence. Notably, we establish that the estimation difficulties associated with the variance components of both risks can be summarized through the trace of the variance-covariance matrix of the regression errors. Our findings suggest that the benefits of overparameterization can extend to time series, panel and grouped data.

Paper Structure (15 sections, 7 theorems, 49 equations, 5 figures)

This paper contains 15 sections, 7 theorems, 49 equations, 5 figures.

Introduction
The Framework under General Assumptions on Regression Errors
The Variance Components of Prediction and Estimation Risks
The variance component of prediction risk
The variance component of estimation risk
Numerical experiments
AR(1) Errors
Clustered Errors
The Bias Components of Prediction and Estimation Risks
The bias component of prediction risk
The bias component of estimation risk
Asymptotic analysis of estimation risk
Details for drawing Figure \ref{['fig:ACS']}
Details for drawing Figures \ref{['fig:AR']}, \ref{['fig:Group']}, and \ref{['fig:asymp']}
Proofs omitted in the main text

Key Result

Lemma 3.1

For a subset ${\mathcal{S}}\subset \mathbb{R}^{m\times m}$ satisfying $C^{-1}\in {\mathcal{S}}$ for all $C\in {\mathcal{S}}$, if matrix-valued random variables $Z$ and $AZ$ have the same distribution measure $\mu_Z$ for any $A \in {\mathcal{S}}$, then we have for any function $f\in L^1(\mu_Z)$ and any probability density function $\nu$ on ${\mathcal{S}}$.

Figures (5)

Figure 1: Comparison of in-sample and out-of-sample mean squared error (MSE) across various degrees of clustered noise. The vertical line indicates $p=n\;(=1,415)$.
Figure 2: Our theory (dashed lines) matches the expected variances (solid lines) of the prediction (left) and estimation risks (right) in Example \ref{['ex:ar']} (AR(1) Errors). Each point $(\sigma^2,\rho^2)$ represents a different noise covariance matrix $\Omega$, but with the same $\mathop{\mathrm{Tr}}\nolimits(\Omega)$ along each line $\{(\sigma^2,\rho^2): \sigma^2/\kappa^2+\rho^2 = 1\}$ for some $\kappa^2>0$, they have the same expected variance. We set $n=50,p=100$, and evaluate on 100 samples of $X$ and 100 samples of $\varepsilon$ (for each realization of $X$) to approximate the expectations.
Figure 3: Our theory (dashed lines) matches the expected variances (solid lines) of the prediction (left) and estimation risks (right) in Example \ref{['ex:cluster']} (Clustered Errors). Each point $(\sigma^2,\rho^2)$ represents a different noise covariance matrix $\Omega$, but with the same $\mathop{\mathrm{Tr}}\nolimits(\Omega)$ along each line $\{(\sigma_1^2,\sigma_2^2): \frac{n_1}{n}\sigma_1^2+\frac{n_2}{n}\sigma_2^2 = \kappa^2\}$ for some $\kappa^2>0$, they have the same expected variance. We set $G=2,(n_1=5,n_2=15),n=20,p=40,\rho_1=\rho_2=0.05$, and evaluate on 100 samples of $X$ and 100 samples of $\varepsilon$ (for each realization of $X$) to approximate the expectations.
Figure 4: The "descent curve" in the overparameterization regime for prediction risk (left) and estimation risk (right). We test $\Omega$'s with $\mathop{\mathrm{Tr}}\nolimits(\Omega)/n=1,2,4$ in black, blue, red, respectively. For the anisotropic feature, the expected variance ($\times$) and its theoretical expression ($\medbullet$) are $\Theta\left(\frac{\mathop{\mathrm{Tr}}\nolimits(\Omega)/n}{\gamma-1}\right)$ and larger than that in the high-dimensional asymptotics for the isotropic $\Sigma=I$. For the isotropic $\Sigma=I$, the variance terms (dotted) and the bias terms (dashed) in the high-dimensional asymptotics are $\frac{1}{\gamma-1}\lim_{n\rightarrow\infty}\frac{\mathop{\mathrm{Tr}}\nolimits(\Omega)}{n}$ and $r^2\left(1-\frac{1}{\gamma}\right)$, respectively.
Figure 5: We use the same setting as Figure \ref{['fig:Group']}, except uniformly sample each $\rho_i$ from $[0 ,0.05]$ for each experiment with the pairs $(\sigma_1^2,\sigma_1^2)$. As expected, the off-diagonal elements $\rho_i$ of $\Omega$ do not affect the expected variances.

Theorems & Definitions (19)

Example 2.1: AR(1) Errors
Example 2.2: Clustered Errors
Definition 3.1: Left-Spherical Symmetry dawid1977sphericaldawid1978extendibilitydawid1981somegupta1999matrix
Lemma 3.1
Theorem 3.2
proof : Sketch of Proof
Theorem 3.3
Corollary 4.1
Corollary 4.2
Definition 4.1
...and 9 more

Prediction Risk and Estimation Risk of the Ridgeless Least Squares Estimator under General Assumptions on Regression Errors

TL;DR

Abstract

Prediction Risk and Estimation Risk of the Ridgeless Least Squares Estimator under General Assumptions on Regression Errors

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (19)