Understanding the Generalization Benefits of Late Learning Rate Decay

Yinuo Ren; Chao Ma; Lexing Ying

Understanding the Generalization Benefits of Late Learning Rate Decay

Yinuo Ren, Chao Ma, Lexing Ying

TL;DR

It is demonstrated that an extended phase with a large learning rate steers the model towards the minimum norm solution of the training loss, which may achieve near-optimal generalization, thereby affirming the empirically observed benefits of late learning rate decay.

Abstract

Why do neural networks trained with large learning rates for a longer time often lead to better generalization? In this paper, we delve into this question by examining the relation between training and testing loss in neural networks. Through visualization of these losses, we note that the training trajectory with a large learning rate navigates through the minima manifold of the training loss, finally nearing the neighborhood of the testing loss minimum. Motivated by these findings, we introduce a nonlinear model whose loss landscapes mirror those observed for real neural networks. Upon investigating the training process using SGD on our model, we demonstrate that an extended phase with a large learning rate steers our model towards the minimum norm solution of the training loss, which may achieve near-optimal generalization, thereby affirming the empirically observed benefits of late learning rate decay.

Understanding the Generalization Benefits of Late Learning Rate Decay

TL;DR

Abstract

Paper Structure (23 sections, 14 theorems, 136 equations, 4 figures)

This paper contains 23 sections, 14 theorems, 136 equations, 4 figures.

INTRODUCTION
Contribution
Related Works
MOTIVATING EMPIRICAL OBSERVATIONS
Observations from Training Behaviors
Visualization of Training and Testing Landscape
Intuitive Cause of the Landscape Structure
AN ILLUSTRATIVE MODEL
Model Settings
Motivation
Main Results
Phase I
Phase II
SGD with Label Noise.
Stationary Distribution along the Normal Space.
...and 8 more sections

Key Result

Theorem 3.1

For any initilization ${\bm{w}}(0)={\bm{w}}_0$, under the dynamics in Equation eq:1_sde, the dynamics of the projection of ${\bm{w}}(t)$ onto the column space of ${\bm{X}}$, denoted by ${\bm{w}}_{\bm{X}}(t)$, have exponential mixing property, i.e. for any two initializations ${\bm{w}}_0$ and ${\bm{w where $P_0^t({\bm{w}}_0, \cdot)$ is the distribution of ${\bm{w}}_{\bm{X}}(t)$ starting from ${\bm{

Figures (4)

Figure 1: Bahaviors of the training and testing losses for a VGG-11 model trained on the CIFAR-10 dataset under various learning rate schedules. Panel (a) showcases the learning curves of the main path with a learning rate of 0.1. In panels (b) and (c), the $y$-axis represents the number of epochs before the learning rate decay, and the $x$-axis indicates the number of epochs after the decay. Each slice parallel to the $x$-axis illustrates the learning curve of a subpath originating from the same main path as shown in (a) with a learning rate of 0.01.
Figure 2: Visualization of the training and testing loss landscapes for a VGG-11 model trained on the CIFAR-10 dataset. The main path with the initial learning rate of 0.1 is represented by an orange line, while the subpaths with the reduced learning rate of 0.01 are depicted in blue dashed lines. The final point of the most extended training trajectory that spans 2000 epochs is marked with a red star.
Figure 3: Comparison of training and testing loss landscapes of the linear regression model $y = {\bm{w}}^\top {\bm{x}}$wu2020direction, the linear diagonal network model $y = ({\bm{w}}^{\odot 3})^\top {\bm{x}}$gunasekar2018implicit, and our reparametrization model $y = \|{\bm{w}}\|^2 {\bm{w}}^\top {\bm{x}}$. In this example, we choose $d=2$, $n=1$, ${\bm{w}}^* = (-1, 0.5)^\top$ and ${\bm{X}} = (0.15, -0.7)^\top$.
Figure 4: Illustration of the normal space ${\mathcal{N}}({\bm{w}}_{\mathcal{M}};{\mathcal{M}})$ (highlighteed in red) and the tangent space ${\mathcal{T}}({\bm{w}}_{\mathcal{M}};{\mathcal{M}})$ (highlighted in blue) of the manifold ${\mathcal{M}}$ around a point ${\bm{w}}_{\mathcal{M}}$ on ${\mathcal{M}}$. The gradient flow trajectory starting from ${\bm{w}}(0)$ is represented by a black curve. Due to the stochastic gradient, the dynamics of ${\bm{w}}(t)$ do not exactly follow the gradient flow but still enter the neighborhood of ${\mathcal{M}}$, depicted as the region between the two dashed lines in Phase I. During Phase II, we focus on the effective dynamics of ${\bm{w}}(t)$ along the minima manifold ${\mathcal{M}}$, denoted by ${\bm{w}}_{\mathcal{M}}(t)$. The time scale separation of the dynamics in the normal and tangent spaces allows a quasistatic approach for the analysis of ${\bm{w}}_{\mathcal{M}}(t)$.

Theorems & Definitions (33)

Theorem 3.1
Remark 3.2
Lemma 3.3
Theorem 3.4
Remark 3.5
Theorem 3.6
Lemma A.1
proof
Corollary A.2
proof
...and 23 more

Understanding the Generalization Benefits of Late Learning Rate Decay

TL;DR

Abstract

Understanding the Generalization Benefits of Late Learning Rate Decay

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (33)