Table of Contents
Fetching ...

On Regularization via Early Stopping for Least Squares Regression

Rishi Sonthalia, Jackie Lok, Elizaveta Rebrova

TL;DR

This paper analyzes the dynamics of discrete full batch gradient descent for linear regression and shows that when training with a learning rate schedule and a finite time horizon, the early stopped solution is equivalent to the minimum norm solution for a generalized ridge regularized problem.

Abstract

A fundamental problem in machine learning is understanding the effect of early stopping on the parameters obtained and the generalization capabilities of the model. Even for linear models, the effect is not fully understood for arbitrary learning rates and data. In this paper, we analyze the dynamics of discrete full batch gradient descent for linear regression. With minimal assumptions, we characterize the trajectory of the parameters and the expected excess risk. Using this characterization, we show that when training with a learning rate schedule $η_k$, and a finite time horizon $T$, the early stopped solution $β_T$ is equivalent to the minimum norm solution for a generalized ridge regularized problem. We also prove that early stopping is beneficial for generic data with arbitrary spectrum and for a wide variety of learning rate schedules. We provide an estimate for the optimal stopping time and empirically demonstrate the accuracy of our estimate.

On Regularization via Early Stopping for Least Squares Regression

TL;DR

This paper analyzes the dynamics of discrete full batch gradient descent for linear regression and shows that when training with a learning rate schedule and a finite time horizon, the early stopped solution is equivalent to the minimum norm solution for a generalized ridge regularized problem.

Abstract

A fundamental problem in machine learning is understanding the effect of early stopping on the parameters obtained and the generalization capabilities of the model. Even for linear models, the effect is not fully understood for arbitrary learning rates and data. In this paper, we analyze the dynamics of discrete full batch gradient descent for linear regression. With minimal assumptions, we characterize the trajectory of the parameters and the expected excess risk. Using this characterization, we show that when training with a learning rate schedule , and a finite time horizon , the early stopped solution is equivalent to the minimum norm solution for a generalized ridge regularized problem. We also prove that early stopping is beneficial for generic data with arbitrary spectrum and for a wide variety of learning rate schedules. We provide an estimate for the optimal stopping time and empirically demonstrate the accuracy of our estimate.
Paper Structure (20 sections, 21 theorems, 121 equations, 1 figure)

This paper contains 20 sections, 21 theorems, 121 equations, 1 figure.

Key Result

Theorem 1

Let $X = U\Sigma_X V^T$ be any training data matrix. If $\beta_k$ is the parameter after $k$ steps of gradient descent from any initialization $\beta_0$ with arbitrary learning rate schedule $\eta_k$ and regularization parameter $\lambda$, then we have that If we let $\varepsilon = y - X\beta_*$ be the residual then Here, the vectors $\tilde{\beta}_k = V^T \beta_k$ and $\tilde{\beta}_* = V^T \be

Figures (1)

  • Figure 1: Figures with excess risk curves and stopping times different $p,n,\tau$, and $\alpha$. The learning rate schedule is $\eta_k = 0.9\lambda_{max}(\hat{\Sigma})^{-1}/k^m$. From left to right, we have $m = 0, 1/4, 1/2, 3/4$. The estimated stopping time is in purple, and the true stopping time is in green.

Theorems & Definitions (39)

  • Definition 1
  • Theorem 1: Trajectory
  • Remark 1
  • Theorem 2: Early Stopping $\Rightarrow$ Generalization Ridge Regression
  • Theorem 3: Equivalence of Early Stopping and Generalized Ridge Regularization, Part 2
  • Theorem 4: Risk
  • Theorem 5: Early Stopping
  • Remark 2
  • Theorem 6: Early Stopping Converse
  • Theorem 6: Trajectory
  • ...and 29 more