Table of Contents
Fetching ...

A Minimalist Example of Edge-of-Stability and Progressive Sharpening

Liming Liu, Zixuan Zhang, Simon Du, Tuo Zhao

TL;DR

The paper builds a minimal two-layer, width-one network with a two-dimensional input to rigorously analyze Edge of Stability and Progressive Sharpening under large learning rates. It develops a nonasymptotic GD framework that exhibits three phases—progressive sharpening, edge-of-stability with PS and self-stabilization—and proves global convergence with a sharpness bound $S(\theta)\le (2+\delta)/\eta$. By decomposing the loss into oscillatory and convergent parts and introducing surrogate losses $\widehat{L}$, it provides explicit decay rates and links to gradient-flow and constrained-trajectory theories. The work also demonstrates how input-data distribution (one relevant, one irrelevant feature) and a well-behaved stable set reconcile minimalist and generalist analyses, offering insights into when large learning rates help or hinder optimization in practice. Overall, the results illuminate how EoS phenomena arise from parameter and input-distribution interactions and suggest principled ways to harness large learning rates in deep learning settings.

Abstract

Recent advances in deep learning optimization have unveiled two intriguing phenomena under large learning rates: Edge of Stability (EoS) and Progressive Sharpening (PS), challenging classical Gradient Descent (GD) analyses. Current research approaches, using either generalist frameworks or minimalist examples, face significant limitations in explaining these phenomena. This paper advances the minimalist approach by introducing a two-layer network with a two-dimensional input, where one dimension is relevant to the response and the other is irrelevant. Through this model, we rigorously prove the existence of progressive sharpening and self-stabilization under large learning rates, and establish non-asymptotic analysis of the training dynamics and sharpness along the entire GD trajectory. Besides, we connect our minimalist example to existing works by reconciling the existence of a well-behaved ``stable set" between minimalist and generalist analyses, and extending the analysis of Gradient Flow Solution sharpness to our two-dimensional input scenario. These findings provide new insights into the EoS phenomenon from both parameter and input data distribution perspectives, potentially informing more effective optimization strategies in deep learning practice.

A Minimalist Example of Edge-of-Stability and Progressive Sharpening

TL;DR

The paper builds a minimal two-layer, width-one network with a two-dimensional input to rigorously analyze Edge of Stability and Progressive Sharpening under large learning rates. It develops a nonasymptotic GD framework that exhibits three phases—progressive sharpening, edge-of-stability with PS and self-stabilization—and proves global convergence with a sharpness bound . By decomposing the loss into oscillatory and convergent parts and introducing surrogate losses , it provides explicit decay rates and links to gradient-flow and constrained-trajectory theories. The work also demonstrates how input-data distribution (one relevant, one irrelevant feature) and a well-behaved stable set reconcile minimalist and generalist analyses, offering insights into when large learning rates help or hinder optimization in practice. Overall, the results illuminate how EoS phenomena arise from parameter and input-distribution interactions and suggest principled ways to harness large learning rates in deep learning settings.

Abstract

Recent advances in deep learning optimization have unveiled two intriguing phenomena under large learning rates: Edge of Stability (EoS) and Progressive Sharpening (PS), challenging classical Gradient Descent (GD) analyses. Current research approaches, using either generalist frameworks or minimalist examples, face significant limitations in explaining these phenomena. This paper advances the minimalist approach by introducing a two-layer network with a two-dimensional input, where one dimension is relevant to the response and the other is irrelevant. Through this model, we rigorously prove the existence of progressive sharpening and self-stabilization under large learning rates, and establish non-asymptotic analysis of the training dynamics and sharpness along the entire GD trajectory. Besides, we connect our minimalist example to existing works by reconciling the existence of a well-behaved ``stable set" between minimalist and generalist analyses, and extending the analysis of Gradient Flow Solution sharpness to our two-dimensional input scenario. These findings provide new insights into the EoS phenomenon from both parameter and input data distribution perspectives, potentially informing more effective optimization strategies in deep learning practice.

Paper Structure

This paper contains 31 sections, 12 theorems, 158 equations, 15 figures.

Key Result

Theorem 4.1

For any $\delta > 0$ and $\epsilon > 0$, there exists a time $T(\delta, \epsilon)$, such that for any $t \geq T(\delta, \epsilon)$, we have

Figures (15)

  • Figure 1: Set $\lambda_1 = 100$ and $\lambda_2 = 0.01$. We train our model with learning rate $\eta=1/20$ for 10000 iterations.
  • Figure 2: Same setting as Figure \ref{['train_loss']}. In the left figure we plot $L_(\theta)$, $L_2(\theta)$ and $\widehat{L}(\theta)$ in log scale. We can see that $\widehat{L}(\theta)$ nicely reflects the decay rate for $L_2(\theta)$. The slope of the red dashed line is $2\log(1-\frac{2\lambda_2}{\lambda_1})$, which nicely reflect the decrease rate of $\widehat{L}(\theta)$. In the right figure we plot $L_1(\theta)$ and $L_2(\theta)$. In most time $L_1(\theta)$ is near zero unless spikes occur.
  • Figure 3: We choose learning rate $\eta$ to be $\frac{1}{12}$, and show the evolvement of upper and lower bound in Theorem \ref{['gfb']}
  • Figure 4: Same setting as Figure \ref{['train_loss']}. We plot the GFs starting from different points on the GD trajectory and the minimizers these GFs converge to.
  • Figure 5: Same stetting as Figure \ref{['train_loss']}. We visualize the GD trajectory with learning rate $\eta = \frac{1}{20}$, as well as the GF and constrained trajectory starting from the same initialization.
  • ...and 10 more figures

Theorems & Definitions (17)

  • Definition 3.1: Sharpness of $L(\theta)$
  • Theorem 4.1: Global Convergence
  • Lemma 4.2
  • Theorem 4.3: Progressive Sharpening
  • Theorem 4.4: Edge of Stability
  • Lemma 4.5
  • Theorem 4.6
  • Lemma 5.1
  • Theorem 5.2
  • Remark 5.3
  • ...and 7 more