Table of Contents
Fetching ...

Failures and Successes of Cross-Validation for Early-Stopped Gradient Descent

Pratik Patil, Yuchen Wu, Ryan J. Tibshirani

TL;DR

This work analyzes generalized cross-validation (GCV) and leave-one-out CV (LOOCV) for early-stopped gradient descent (GD) in high-dimensional least-squares regression. It proves that, under proportional asymptotics with p/n \\to \\zeta_*, GCV is generically inconsistent for the GD risk along the training path, even in well-specified linear models with isotropic features, while LOOCV remains uniformly consistent along the entire GD trajectory. Importantly, LOOCV errors yield consistent estimators for the full prediction-error distribution and for a wide class of error functionals, enabling pathwise prediction intervals with nominal coverage conditional on the training data. The paper also introduces a modified augmented-system approach that recovers exact LOOCV predictions for GD and discusses computational shortcuts, highlighting a practical advantage of LOOCV over GCV in this setting. Overall, the results formalize when CV can reliably guide early stopping for GD and provide tools for distributional inference along the GD path in high-dimensional regimes.

Abstract

We analyze the statistical properties of generalized cross-validation (GCV) and leave-one-out cross-validation (LOOCV) applied to early-stopped gradient descent (GD) in high-dimensional least squares regression. We prove that GCV is generically inconsistent as an estimator of the prediction risk of early-stopped GD, even for a well-specified linear model with isotropic features. In contrast, we show that LOOCV converges uniformly along the GD trajectory to the prediction risk. Our theory requires only mild assumptions on the data distribution and does not require the underlying regression function to be linear. Furthermore, by leveraging the individual LOOCV errors, we construct consistent estimators for the entire prediction error distribution along the GD trajectory and consistent estimators for a wide class of error functionals. This in particular enables the construction of pathwise prediction intervals based on GD iterates that have asymptotically correct nominal coverage conditional on the training data.

Failures and Successes of Cross-Validation for Early-Stopped Gradient Descent

TL;DR

This work analyzes generalized cross-validation (GCV) and leave-one-out CV (LOOCV) for early-stopped gradient descent (GD) in high-dimensional least-squares regression. It proves that, under proportional asymptotics with p/n \\to \\zeta_*, GCV is generically inconsistent for the GD risk along the training path, even in well-specified linear models with isotropic features, while LOOCV remains uniformly consistent along the entire GD trajectory. Importantly, LOOCV errors yield consistent estimators for the full prediction-error distribution and for a wide class of error functionals, enabling pathwise prediction intervals with nominal coverage conditional on the training data. The paper also introduces a modified augmented-system approach that recovers exact LOOCV predictions for GD and discusses computational shortcuts, highlighting a practical advantage of LOOCV over GCV in this setting. Overall, the results formalize when CV can reliably guide early stopping for GD and provide tools for distributional inference along the GD path in high-dimensional regimes.

Abstract

We analyze the statistical properties of generalized cross-validation (GCV) and leave-one-out cross-validation (LOOCV) applied to early-stopped gradient descent (GD) in high-dimensional least squares regression. We prove that GCV is generically inconsistent as an estimator of the prediction risk of early-stopped GD, even for a well-specified linear model with isotropic features. In contrast, we show that LOOCV converges uniformly along the GD trajectory to the prediction risk. Our theory requires only mild assumptions on the data distribution and does not require the underlying regression function to be linear. Furthermore, by leveraging the individual LOOCV errors, we construct consistent estimators for the entire prediction error distribution along the GD trajectory and consistent estimators for a wide class of error functionals. This in particular enables the construction of pathwise prediction intervals based on GD iterates that have asymptotically correct nominal coverage conditional on the training data.
Paper Structure (83 sections, 54 theorems, 261 equations, 26 figures)

This paper contains 83 sections, 54 theorems, 261 equations, 26 figures.

Key Result

Theorem 1

Suppose that $(x_i,y_i)$, $i \in [n]$ are i.i.d., and satisfy both asm:feature_distasm:response_dist, where either $r^2 > 0$ or $\sigma^2 > 0$. As $n,p \to \infty$, assume $p / n \to \zeta_{\ast}$, and $k \to \infty$, $\delta \to 0$ such that $k \delta \to T$, where $T, \zeta_{\ast} > 0$ are constan where we recall that $\widehat{R}^{\mathrm{gcv}}(\widehat{\bm{\beta}}_k)$ and $R(\widehat{\bm{\beta

Figures (26)

  • Figure 1: GCV can perform poorly in overparameterized problems, yet LOOCV gives accurate risk estimates. We investigate the risk of early-stopped gradient descent, applied to the least squares loss, as a function of iteration number. The left panel shows an underparameterized experiment with $n = 3000$, $p = 1500$, and the right panel an overparameterized experiment with $n = 3000$, $p = 6000$. In both cases, the data is generated from a linear model with i.i.d. standard normal features, a true signal vector with $\ell_2$ norm of $5$, and noise standard deviation of $1$. GD uses a constant step size of $0.01$. In the overparameterized case, we can see that the GCV risk estimate deviates wildly from the true risk, whereas LOOCV remains accurate throughout the entire path.
  • Figure 2: LOOCV provides (asymptotically) valid prediction intervals, for various nominal coverage levels. We investigate the empirical coverage and length of LOOCV prediction intervals along the GD path, at varying coverage levels. We consider an overparameterized regime with $n=2500$ and $p=5000$. The features are drawn from a Gaussian distribution with a covariance structure: $\bm{\Sigma}_{ij} = \rho^{|i - j |}$ for all $i,j$ and $\rho=0.25$. The response is generated from a nonlinear model with heavy-tailed noise: $t$-distribution with 5 degrees of freedom. The linear component of $\mathbb{E}[y_i \,|\, \bm{x}_i = x]$ is aligned with the top eigenvector of $\bm{\Sigma}$. GD is run with a constant step size of $0.01$. (See \ref{['sec:additional-numerical-illustrations']} for further details on the experimental setup.) We can see that the prediction intervals generally have excellent finite-sample coverage along the entire path (left), and the smallest prediction length is typically obtained at a large iteration of GD (right).
  • Figure 4: Illustrations of the differences between the LOO systems for ridge regression (left) and GD (right).
  • Figure 5: Illustration of the modified augmented system for LOO in GD.
  • Figure S.1: Illustration of the Marchenko-Pastur density in the underparameterized (left) and overparameterized regimes (right). Note that in the overparameterized regime, there is a point mass at $s = 0$ (shown with a red dot) as in \ref{['eq:MP-law-gt1']}. This point mass will need special care in the subsequent asymptotic limits.
  • ...and 21 more figures

Theorems & Definitions (93)

  • Theorem 1: Inconsistency of GCV
  • Definition 1: $T_2$-inequality
  • Theorem 2: Squared risk consistency of LOOCV
  • Theorem 3: Functional consistency of LOOCV
  • Theorem 4: Coverage guarantee
  • Proposition 4: Correctness of the modified augmented system
  • Proposition 4: Smoother representation for the modified augmented system
  • Proposition 4: Recursive shortcut formula for LOO predictions in GD
  • Lemma 4: Prediction risks are asymptotically equivalent
  • Lemma 4: GCV risk estimates are asymptotically equivalent
  • ...and 83 more