Implicit vs. explicit regularization for high-dimensional gradient descent
Thomas Stark, Lukas Steinberger
TL;DR
This work analyzes the generalization risk of constant-step regularized gradient descent on an $ ext{OLS}$ objective with explicit $ frac{ℓ_2}{λ}$ regularization in high-dimensional linear settings. It shows that, for $λ$ at least the optimal $λ^* = rac{σ^2}{τ^2}rac{p}{n}$, the out-of-sample risk decreases monotonically with iteration $m$ and converges to the ridge benchmark as $m\to\infty$, indicating that explicit regularization can be statistically efficient while implicit regularization via early stopping may not. The authors provide data-driven, cross-validation-free estimators for $(σ^2,τ^2)$ to form $\hat{λ}_n = γ_n \hat{σ}_n^2/\hat{τ}_n^2$ and prove that RG D tuned with $\hat{λ}_n$ matches the performance of optimally tuned ridge in large $n,p$ regimes, extending Dicker’s ideas to non-Gaussian designs. The results deliver both finite-sample guarantees and practical, scalable tuning procedures for high-dimensional linear prediction without resorting to cross-validation, with broad applicability to non-Gaussian designs and deterministic signal settings.
Abstract
In this paper we investigate the generalization error of gradient descent (GD) applied to an $\ell_2$-regularized OLS objective function in the linear model. Based on our analysis we develop new methodology for computationally tractable and statistically efficient linear prediction in a high-dimensional and massive data scenario (large-$n$, large-$p$). Our results are based on the surprising observation that the generalization error of optimally tuned regularized gradient descent approaches that of an optimal benchmark procedure $monotonically$ in the iteration number $m$. On the other hand standard GD for OLS (without explicit regularization) can achieve the benchmark only in degenerate cases. This shows that (optimal) explicit regularization can be nearly statistically efficient (for large $m$) whereas implicit regularization by (optimal) early stopping can not. To complete our methodology, we provide a fully data driven and computationally tractable choice of $\ell_2$ regularization parameter $λ$ that is computationally cheaper than cross-validation. On this way, we follow and extend ideas of Dicker (2014) to the non-gaussian case, which requires new results on high-dimensional sample covariance matrices that might be of independent interest.
