Estimating Generalization Performance Along the Trajectory of Proximal SGD in Robust Regression

Kai Tan; Pierre C. Bellec

Estimating Generalization Performance Along the Trajectory of Proximal SGD in Robust Regression

Kai Tan, Pierre C. Bellec

TL;DR

This paper introduces estimators that precisely track the generalization error of the iterates along the trajectory of the iterative algorithm, allowing us to determine the optimal stopping iteration that minimizes the generalization error.

Abstract

This paper studies the generalization performance of iterates obtained by Gradient Descent (GD), Stochastic Gradient Descent (SGD) and their proximal variants in high-dimensional robust regression problems. The number of features is comparable to the sample size and errors may be heavy-tailed. We introduce estimators that precisely track the generalization error of the iterates along the trajectory of the iterative algorithm. These estimators are provably consistent under suitable conditions. The results are illustrated through several examples, including Huber regression, pseudo-Huber regression, and their penalized variants with non-smooth regularizer. We provide explicit generalization error estimates for iterates generated from GD and SGD, or from proximal SGD in the presence of a non-smooth regularizer. The proposed risk estimates serve as effective proxies for the actual generalization error, allowing us to determine the optimal stopping iteration that minimizes the generalization error. Extensive simulations confirm the effectiveness of the proposed generalization error estimates.

Estimating Generalization Performance Along the Trajectory of Proximal SGD in Robust Regression

TL;DR

Abstract

Paper Structure (40 sections, 17 theorems, 156 equations, 4 figures, 1 table)

This paper contains 40 sections, 17 theorems, 156 equations, 4 figures, 1 table.

Introduction
Related literature
Problem setup
Robust regression without penalty
Robust regression with Lasso penalty
Main results
Intuition regarding the estimates of the generalization error
Formal matrix notation to capture recursive derivatives
Main results: estimating the generalization error consistently
Simulation
Additional experiments: varying step sizes for different iterations.
Additional experiments: the estimate $\tilde{r}_t^{\rm sub}$ is suboptimal.
Discussion
Additional simulation results
Auxiliary Results
...and 25 more sections

Key Result

Theorem 3.6

Let assu:Xassu:regimeassu:rho-1 be fulfilled. Then $\forall \epsilon > 0$, If additionally assu:noise holds then ${\mathbb{E}}[\min\{1, \frac{\| {\boldsymbol{\varepsilon} }\|}{n} \}]\to 0$, so that, as $n,p\to+\infty$ while $(T,\gamma,\eta_{\max},c_0,\delta,\kappa,\epsilon)$ are held fixed, the right-hand side converges to 0 and $\hat{r}_t - r_t$ converges to 0 in probabil

Figures (4)

Figure 1: Risk curves for Huber and Pseudo-Huber regression with GD and SGD algorithms for the scenario $(n,p) = (10000,5000)$. Upper row: Huber regression, Lower row: Pseudo-Huber regression. Left column: GD, Right column: SGD.
Figure 2: Risk curves for L1-penalized Huber and Pseudo-Huber regression with Proximal GD and Proximal SGD algorithms for the scenario $(n,p) = (10000,12000)$. Upper row: L1-penalized Huber regression, Lower row: L1-penalized Pseudo-Huber regression. Left column: Proximal GD, Right column: Proximal SGD.
Figure 3: Risk curves for SGD applied to Huber regression with $(n,p) = (3000,1000)$ using different choices of step sizes. Left panel:$\eta_t = 1$ if $t$ is odd, and $\eta_t = 0$ if $t$ is even. Right panel:$\eta_t = 1$ if $t$ is odd, and $\eta_t = 0.5$ if $t$ is even.
Figure 4: Risk curves for SGD applied to Huber and pseudo-Huber regression with $(n,p,T)=(4000, 1000, 20)$, $|I_t|=n/10$ and $\eta_t=0.2$ for all $t$.

Theorems & Definitions (32)

Example 2.1: GD
Example 2.2: SGD
Example 2.3: Proximal GD
Example 2.4: Proximal SGD
Theorem 3.6: Proved in \ref{['proof-thm:using-Sigma']}
Theorem 3.7: Proved in \ref{['proof-thm:unknwon-Sigma']}
Remark 3.8
Remark 3.9
Lemma B.1: Proved in \ref{['proof-lem:dot-b']}
Lemma B.2: Proved in \ref{['proof-lem:dF-dx']}
...and 22 more

Estimating Generalization Performance Along the Trajectory of Proximal SGD in Robust Regression

TL;DR

Abstract

Estimating Generalization Performance Along the Trajectory of Proximal SGD in Robust Regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (32)