Failures and Successes of Cross-Validation for Early-Stopped Gradient Descent
Pratik Patil, Yuchen Wu, Ryan J. Tibshirani
TL;DR
This work analyzes generalized cross-validation (GCV) and leave-one-out CV (LOOCV) for early-stopped gradient descent (GD) in high-dimensional least-squares regression. It proves that, under proportional asymptotics with p/n \\to \\zeta_*, GCV is generically inconsistent for the GD risk along the training path, even in well-specified linear models with isotropic features, while LOOCV remains uniformly consistent along the entire GD trajectory. Importantly, LOOCV errors yield consistent estimators for the full prediction-error distribution and for a wide class of error functionals, enabling pathwise prediction intervals with nominal coverage conditional on the training data. The paper also introduces a modified augmented-system approach that recovers exact LOOCV predictions for GD and discusses computational shortcuts, highlighting a practical advantage of LOOCV over GCV in this setting. Overall, the results formalize when CV can reliably guide early stopping for GD and provide tools for distributional inference along the GD path in high-dimensional regimes.
Abstract
We analyze the statistical properties of generalized cross-validation (GCV) and leave-one-out cross-validation (LOOCV) applied to early-stopped gradient descent (GD) in high-dimensional least squares regression. We prove that GCV is generically inconsistent as an estimator of the prediction risk of early-stopped GD, even for a well-specified linear model with isotropic features. In contrast, we show that LOOCV converges uniformly along the GD trajectory to the prediction risk. Our theory requires only mild assumptions on the data distribution and does not require the underlying regression function to be linear. Furthermore, by leveraging the individual LOOCV errors, we construct consistent estimators for the entire prediction error distribution along the GD trajectory and consistent estimators for a wide class of error functionals. This in particular enables the construction of pathwise prediction intervals based on GD iterates that have asymptotically correct nominal coverage conditional on the training data.
