Table of Contents
Fetching ...

Implicit vs. explicit regularization for high-dimensional gradient descent

Thomas Stark, Lukas Steinberger

TL;DR

This work analyzes the generalization risk of constant-step regularized gradient descent on an $ ext{OLS}$ objective with explicit $ frac{ℓ_2}{λ}$ regularization in high-dimensional linear settings. It shows that, for $λ$ at least the optimal $λ^* = rac{σ^2}{τ^2} rac{p}{n}$, the out-of-sample risk decreases monotonically with iteration $m$ and converges to the ridge benchmark as $m\to\infty$, indicating that explicit regularization can be statistically efficient while implicit regularization via early stopping may not. The authors provide data-driven, cross-validation-free estimators for $(σ^2,τ^2)$ to form $\hat{λ}_n = γ_n \hat{σ}_n^2/\hat{τ}_n^2$ and prove that RG D tuned with $\hat{λ}_n$ matches the performance of optimally tuned ridge in large $n,p$ regimes, extending Dicker’s ideas to non-Gaussian designs. The results deliver both finite-sample guarantees and practical, scalable tuning procedures for high-dimensional linear prediction without resorting to cross-validation, with broad applicability to non-Gaussian designs and deterministic signal settings.

Abstract

In this paper we investigate the generalization error of gradient descent (GD) applied to an $\ell_2$-regularized OLS objective function in the linear model. Based on our analysis we develop new methodology for computationally tractable and statistically efficient linear prediction in a high-dimensional and massive data scenario (large-$n$, large-$p$). Our results are based on the surprising observation that the generalization error of optimally tuned regularized gradient descent approaches that of an optimal benchmark procedure $monotonically$ in the iteration number $m$. On the other hand standard GD for OLS (without explicit regularization) can achieve the benchmark only in degenerate cases. This shows that (optimal) explicit regularization can be nearly statistically efficient (for large $m$) whereas implicit regularization by (optimal) early stopping can not. To complete our methodology, we provide a fully data driven and computationally tractable choice of $\ell_2$ regularization parameter $λ$ that is computationally cheaper than cross-validation. On this way, we follow and extend ideas of Dicker (2014) to the non-gaussian case, which requires new results on high-dimensional sample covariance matrices that might be of independent interest.

Implicit vs. explicit regularization for high-dimensional gradient descent

TL;DR

This work analyzes the generalization risk of constant-step regularized gradient descent on an objective with explicit regularization in high-dimensional linear settings. It shows that, for at least the optimal , the out-of-sample risk decreases monotonically with iteration and converges to the ridge benchmark as , indicating that explicit regularization can be statistically efficient while implicit regularization via early stopping may not. The authors provide data-driven, cross-validation-free estimators for to form and prove that RG D tuned with matches the performance of optimally tuned ridge in large regimes, extending Dicker’s ideas to non-Gaussian designs. The results deliver both finite-sample guarantees and practical, scalable tuning procedures for high-dimensional linear prediction without resorting to cross-validation, with broad applicability to non-Gaussian designs and deterministic signal settings.

Abstract

In this paper we investigate the generalization error of gradient descent (GD) applied to an -regularized OLS objective function in the linear model. Based on our analysis we develop new methodology for computationally tractable and statistically efficient linear prediction in a high-dimensional and massive data scenario (large-, large-). Our results are based on the surprising observation that the generalization error of optimally tuned regularized gradient descent approaches that of an optimal benchmark procedure in the iteration number . On the other hand standard GD for OLS (without explicit regularization) can achieve the benchmark only in degenerate cases. This shows that (optimal) explicit regularization can be nearly statistically efficient (for large ) whereas implicit regularization by (optimal) early stopping can not. To complete our methodology, we provide a fully data driven and computationally tractable choice of regularization parameter that is computationally cheaper than cross-validation. On this way, we follow and extend ideas of Dicker (2014) to the non-gaussian case, which requires new results on high-dimensional sample covariance matrices that might be of independent interest.

Paper Structure

This paper contains 14 sections, 17 theorems, 62 equations, 1 figure.

Key Result

Proposition 2.1

If we initialize $\hat{\beta}_0(\lambda,t)=\theta \in \mathbb{R}^p$ and consider running the gradient descent procedure on eq:obj-f with a constant step-size $t>0$ and $\lambda\ge0$, the iterates for $m\geq1$ can be expressed as follows: where $A=A(\lambda,t)\coloneqq(I_p-t(\lambda I_p + X^\top X/n))$.

Figures (1)

  • Figure 1: Generalization errors of different estimators plotted against the number of iterations $m$ from 1000 Monte-Carlo runs. The simulation was done for $\tau^2=2\sigma^2=4$, $p=1000$, $n=500$, $\lambda^*=\frac{\sigma^2}{\tau^2}\frac{p}{n} = 1$ and the entries of $X$ are iid standard normally distributed.

Theorems & Definitions (24)

  • Proposition 2.1
  • Remark 2.2: On convergence of the iterates
  • Lemma 2.3
  • Lemma 2.4
  • Theorem 2.5
  • Remark 2.6: On the choice of step size
  • Lemma 3.1
  • Lemma 3.2
  • Theorem 3.3: Pan (2010)
  • Proposition 3.4
  • ...and 14 more