On Regularization via Early Stopping for Least Squares Regression

Rishi Sonthalia; Jackie Lok; Elizaveta Rebrova

On Regularization via Early Stopping for Least Squares Regression

Rishi Sonthalia, Jackie Lok, Elizaveta Rebrova

TL;DR

This paper analyzes the dynamics of discrete full batch gradient descent for linear regression and shows that when training with a learning rate schedule and a finite time horizon, the early stopped solution is equivalent to the minimum norm solution for a generalized ridge regularized problem.

Abstract

A fundamental problem in machine learning is understanding the effect of early stopping on the parameters obtained and the generalization capabilities of the model. Even for linear models, the effect is not fully understood for arbitrary learning rates and data. In this paper, we analyze the dynamics of discrete full batch gradient descent for linear regression. With minimal assumptions, we characterize the trajectory of the parameters and the expected excess risk. Using this characterization, we show that when training with a learning rate schedule $η_k$, and a finite time horizon $T$, the early stopped solution $β_T$ is equivalent to the minimum norm solution for a generalized ridge regularized problem. We also prove that early stopping is beneficial for generic data with arbitrary spectrum and for a wide variety of learning rate schedules. We provide an estimate for the optimal stopping time and empirically demonstrate the accuracy of our estimate.

On Regularization via Early Stopping for Least Squares Regression

TL;DR

Abstract

, and a finite time horizon

, the early stopped solution

is equivalent to the minimum norm solution for a generalized ridge regularized problem. We also prove that early stopping is beneficial for generic data with arbitrary spectrum and for a wide variety of learning rate schedules. We provide an estimate for the optimal stopping time and empirically demonstrate the accuracy of our estimate.

Paper Structure (20 sections, 21 theorems, 121 equations, 1 figure)

This paper contains 20 sections, 21 theorems, 121 equations, 1 figure.

Introduction
Other Related Works
Generalization error and regularization.
Gradient flow dynamics.
SGD learning rates.
Other related works.
Setup and Preliminaries
Contributions and Organization
Exact Trajectories
Early Stopping and Generalized Ridge Regularization
When to Stop Training?
Early Stopped Risk
When is Early Stopping Beneficial?
Optimal Stopping Time
Experimental Validation
...and 5 more sections

Key Result

Theorem 1

Let $X = U\Sigma_X V^T$ be any training data matrix. If $\beta_k$ is the parameter after $k$ steps of gradient descent from any initialization $\beta_0$ with arbitrary learning rate schedule $\eta_k$ and regularization parameter $\lambda$, then we have that If we let $\varepsilon = y - X\beta_*$ be the residual then Here, the vectors $\tilde{\beta}_k = V^T \beta_k$ and $\tilde{\beta}_* = V^T \be

Figures (1)

Figure 1: Figures with excess risk curves and stopping times different $p,n,\tau$, and $\alpha$. The learning rate schedule is $\eta_k = 0.9\lambda_{max}(\hat{\Sigma})^{-1}/k^m$. From left to right, we have $m = 0, 1/4, 1/2, 3/4$. The estimated stopping time is in purple, and the true stopping time is in green.

Theorems & Definitions (39)

Definition 1
Theorem 1: Trajectory
Remark 1
Theorem 2: Early Stopping $\Rightarrow$ Generalization Ridge Regression
Theorem 3: Equivalence of Early Stopping and Generalized Ridge Regularization, Part 2
Theorem 4: Risk
Theorem 5: Early Stopping
Remark 2
Theorem 6: Early Stopping Converse
Theorem 6: Trajectory
...and 29 more

On Regularization via Early Stopping for Least Squares Regression

TL;DR

Abstract

On Regularization via Early Stopping for Least Squares Regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (39)