Table of Contents
Fetching ...

The distribution of Ridgeless least squares interpolators

Qiyang Han, Xiaocong Xu

TL;DR

The paper delivers a comprehensive high-dimensional distributional theory for the Ridgeless interpolator in overparametrized linear models by linking it to a Ridge estimator in an associated Gaussian sequence model with effective noise and regularization solved through fixed-point equations. It establishes uniform distributional characterizations for all $\ell_q$-type risks, unveils the implicit regularization mechanism via the positive implicit regularization parameter $\tau_{\eta,\ast}$, and proves universality across Gaussian and non-Gaussian designs. The work further shows that cross-validation methods (GCV and $k$-fold CV) asymptotically optimize not only prediction risk but also estimation and in-sample risks, enabling debiased inference with short confidence intervals. The results are underpinned by mean-field arguments, CGMT-based proof strategies, and a rigorous treatment of fixed-point equations with both population and sample versions. Collectively, the findings provide a precise, distributional understanding of Ridgeless interpolation, illuminate cross-validation’s broader utility, and offer principled tools for inference in highly overparameterized regimes.

Abstract

The Ridgeless minimum $\ell_2$-norm interpolator in overparametrized linear regression has attracted considerable attention in recent years in both machine learning and statistics communities. While it seems to defy conventional wisdom that overfitting leads to poor prediction, recent theoretical research on its $\ell_2$-type risks reveals that its norm minimizing property induces an `implicit regularization' that helps prediction in spite of interpolation. This paper takes a further step that aims at understanding its precise stochastic behavior as a statistical estimator. Specifically, we characterize the distribution of the Ridgeless interpolator in high dimensions, in terms of a Ridge estimator in an associated Gaussian sequence model with positive regularization, which provides a precise quantification of the prescribed implicit regularization in the most general distributional sense. Our distributional characterizations hold for general non-Gaussian random designs and extend uniformly to positively regularized Ridge estimators. As a direct application, we obtain a complete characterization for a general class of weighted $\ell_q$ risks of the Ridge(less) estimators that are previously only known for $q=2$ by random matrix methods. These weighted $\ell_q$ risks not only include the standard prediction and estimation errors, but also include the non-standard covariate shift settings. Our uniform characterizations further reveal a surprising feature of the commonly used generalized and $k$-fold cross-validation schemes: tuning the estimated $\ell_2$ prediction risk by these methods alone lead to simultaneous optimal $\ell_2$ in-sample, prediction and estimation risks, as well as the optimal length of debiased confidence intervals.

The distribution of Ridgeless least squares interpolators

TL;DR

The paper delivers a comprehensive high-dimensional distributional theory for the Ridgeless interpolator in overparametrized linear models by linking it to a Ridge estimator in an associated Gaussian sequence model with effective noise and regularization solved through fixed-point equations. It establishes uniform distributional characterizations for all -type risks, unveils the implicit regularization mechanism via the positive implicit regularization parameter , and proves universality across Gaussian and non-Gaussian designs. The work further shows that cross-validation methods (GCV and -fold CV) asymptotically optimize not only prediction risk but also estimation and in-sample risks, enabling debiased inference with short confidence intervals. The results are underpinned by mean-field arguments, CGMT-based proof strategies, and a rigorous treatment of fixed-point equations with both population and sample versions. Collectively, the findings provide a precise, distributional understanding of Ridgeless interpolation, illuminate cross-validation’s broader utility, and offer principled tools for inference in highly overparameterized regimes.

Abstract

The Ridgeless minimum -norm interpolator in overparametrized linear regression has attracted considerable attention in recent years in both machine learning and statistics communities. While it seems to defy conventional wisdom that overfitting leads to poor prediction, recent theoretical research on its -type risks reveals that its norm minimizing property induces an `implicit regularization' that helps prediction in spite of interpolation. This paper takes a further step that aims at understanding its precise stochastic behavior as a statistical estimator. Specifically, we characterize the distribution of the Ridgeless interpolator in high dimensions, in terms of a Ridge estimator in an associated Gaussian sequence model with positive regularization, which provides a precise quantification of the prescribed implicit regularization in the most general distributional sense. Our distributional characterizations hold for general non-Gaussian random designs and extend uniformly to positively regularized Ridge estimators. As a direct application, we obtain a complete characterization for a general class of weighted risks of the Ridge(less) estimators that are previously only known for by random matrix methods. These weighted risks not only include the standard prediction and estimation errors, but also include the non-standard covariate shift settings. Our uniform characterizations further reveal a surprising feature of the commonly used generalized and -fold cross-validation schemes: tuning the estimated prediction risk by these methods alone lead to simultaneous optimal in-sample, prediction and estimation risks, as well as the optimal length of debiased confidence intervals.
Paper Structure (65 sections, 55 theorems, 398 equations, 2 figures)

This paper contains 65 sections, 55 theorems, 398 equations, 2 figures.

Key Result

Proposition 2.1

Recall $\mathcal{H}_\Sigma=\mathop{\mathrm{tr}}\nolimits(\Sigma^{-1})/n$. The following hold.

Figures (2)

  • Figure 1: Left panel: Comparison between empirical risks and theoretical risks for $\ast=\mathop{\mathrm{\mathsf{GCV}}}\nolimits$ and $\bullet = \mathop{\mathrm{\mathsf{CV}}}\nolimits$ with $k=5$. Middle panel: Averaged CI coverage $\mathscr{C}^{\mathop{\mathrm{\mathsf{dR}}}\nolimits}(\widehat{\eta}^{\#})$ for $\# \in \{\mathop{\mathrm{\mathsf{GCV}}}\nolimits,\mathop{\mathrm{\mathsf{CV}}}\nolimits\}$ and the oracle $\mathscr{C}^{\mathop{\mathrm{\mathsf{dR}}}\nolimits}(\eta_\ast)$. Right panel: CI length of $\mathrm{CI}_1(\widehat{\eta}^{\#})$ for $\# \in \{\mathop{\mathrm{\mathsf{GCV}}}\nolimits,\mathop{\mathrm{\mathsf{CV}}}\nolimits\}$ and the oracle CI length. See Section \ref{['section:application']} for the precise definitions.
  • Figure 2: Validation of (\ref{['eqn:opt_reg_l2']}) (see also Theorem \ref{['thm:small_intep']} for a rigorous formulation). The theoretical risks $\bar{R}^{\#}_{(\Sigma,\mu_0)}(\eta)$ are computed by solving (\ref{['eqn:fpe']}), and the empirical risks $R^{\#}_{(\Sigma,\mu_0)}(\eta)$ are computed via Monte Carlo simulation over 200 repetitions. Left panel: noisy case with minimal empirical risks attained at $\eta_\ast=\mathop{\mathrm{\mathsf{SNR}}}\nolimits_{\mu_0}^{-1}=1$ (marked with $\ast$). Middle panel: noiseless case with all risks minimized at the interpolation regime $\eta_\ast =\mathop{\mathrm{\mathsf{SNR}}}\nolimits_{\mu_0}^{-1}=0$. Right panel: differences between the global minimizer of the empirical risk curves and the oracle $\eta_\ast$ are concentrated around $0$ over 500 different $\mu_0$’s.

Theorems & Definitions (109)

  • Remark 1
  • Proposition 2.1
  • Theorem 2.2
  • Theorem 2.3
  • Remark 2
  • Theorem 3.1
  • Proposition 3.2
  • Theorem 3.3
  • Proposition 3.4
  • Theorem 4.1
  • ...and 99 more