Table of Contents
Fetching ...

Stepping on the Edge: Curvature Aware Learning Rate Tuners

Vincent Roulet, Atish Agarwala, Jean-Bastien Grill, Grzegorz Swirszcz, Mathieu Blondel, Fabian Pedregosa

TL;DR

The paper investigates how curvature dynamics, particularly the largest Hessian eigenvalue $ abla^2 f$ and sharpness, interact with learning-rate tuners in deep learning. It shows that classical tuners such as linesearch and quadratic-greedy rules underperform fixed learning rates in the full-batch regime due to undershooting the edge of stability (EOS) and destabilizing sharpness, leading to slower long-term progress. To address this, the authors propose Curvature Dynamics Aware Tuning (CDAT), which aims to keep the optimizer near EOS by setting $ ext{LR}_t=oldsymbol{c3} rac{n_t}{d_t}$ with $n_t= ext{max}ig\{- abla f(w_t)^ op u_t,0ig\ig o$ and $d_t=|u_t^ op abla^2 f(w_t)u_t|+oldsymbol{b epsilon}$, and smoothing via EMA; the scaling factor $oldsymbol{c3}$ interpolates between greedy and on-edge behavior. Empirically, CDAT matches or outperforms tuned constant learning rates in the full-batch regime and exhibits warm-up-like LR progression that stabilizes sharpness, reducing progressive sharpening. In stochastic mini-batch settings, stochasticity mitigates some EOS feedback, reducing CDAT’s advantages and making batch-size dependent scaling important; a simple theoretical model clarifies the joint dynamics of LR and sharpness and highlights limitations of existing models. Overall, the work argues that leveraging curvature stabilization, rather than greedy loss decrease, yields more reliable long-term training and suggests directions for integrating curvature-aware control into adaptive optimizers.

Abstract

Curvature information -- particularly, the largest eigenvalue of the loss Hessian, known as the sharpness -- often forms the basis for learning rate tuners. However, recent work has shown that the curvature information undergoes complex dynamics during training, going from a phase of increasing sharpness to eventual stabilization. We analyze the closed-loop feedback effect between learning rate tuning and curvature. We find that classical learning rate tuners may yield greater one-step loss reduction, yet they ultimately underperform in the long term when compared to constant learning rates in the full batch regime. These models break the stabilization of the sharpness, which we explain using a simplified model of the joint dynamics of the learning rate and the curvature. To further investigate these effects, we introduce a new learning rate tuning method, Curvature Dynamics Aware Tuning (CDAT), which prioritizes long term curvature stabilization over instantaneous progress on the objective. In the full batch regime, CDAT shows behavior akin to prefixed warm-up schedules on deep learning objectives, outperforming tuned constant learning rates. In the mini batch regime, we observe that stochasticity introduces confounding effects that explain the previous success of some learning rate tuners at appropriate batch sizes. Our findings highlight the critical role of understanding the joint dynamics of the learning rate and curvature, beyond greedy minimization, to diagnose failures and design effective adaptive learning rate tuners.

Stepping on the Edge: Curvature Aware Learning Rate Tuners

TL;DR

The paper investigates how curvature dynamics, particularly the largest Hessian eigenvalue and sharpness, interact with learning-rate tuners in deep learning. It shows that classical tuners such as linesearch and quadratic-greedy rules underperform fixed learning rates in the full-batch regime due to undershooting the edge of stability (EOS) and destabilizing sharpness, leading to slower long-term progress. To address this, the authors propose Curvature Dynamics Aware Tuning (CDAT), which aims to keep the optimizer near EOS by setting with and , and smoothing via EMA; the scaling factor interpolates between greedy and on-edge behavior. Empirically, CDAT matches or outperforms tuned constant learning rates in the full-batch regime and exhibits warm-up-like LR progression that stabilizes sharpness, reducing progressive sharpening. In stochastic mini-batch settings, stochasticity mitigates some EOS feedback, reducing CDAT’s advantages and making batch-size dependent scaling important; a simple theoretical model clarifies the joint dynamics of LR and sharpness and highlights limitations of existing models. Overall, the work argues that leveraging curvature stabilization, rather than greedy loss decrease, yields more reliable long-term training and suggests directions for integrating curvature-aware control into adaptive optimizers.

Abstract

Curvature information -- particularly, the largest eigenvalue of the loss Hessian, known as the sharpness -- often forms the basis for learning rate tuners. However, recent work has shown that the curvature information undergoes complex dynamics during training, going from a phase of increasing sharpness to eventual stabilization. We analyze the closed-loop feedback effect between learning rate tuning and curvature. We find that classical learning rate tuners may yield greater one-step loss reduction, yet they ultimately underperform in the long term when compared to constant learning rates in the full batch regime. These models break the stabilization of the sharpness, which we explain using a simplified model of the joint dynamics of the learning rate and the curvature. To further investigate these effects, we introduce a new learning rate tuning method, Curvature Dynamics Aware Tuning (CDAT), which prioritizes long term curvature stabilization over instantaneous progress on the objective. In the full batch regime, CDAT shows behavior akin to prefixed warm-up schedules on deep learning objectives, outperforming tuned constant learning rates. In the mini batch regime, we observe that stochasticity introduces confounding effects that explain the previous success of some learning rate tuners at appropriate batch sizes. Our findings highlight the critical role of understanding the joint dynamics of the learning rate and curvature, beyond greedy minimization, to diagnose failures and design effective adaptive learning rate tuners.
Paper Structure (72 sections, 23 equations, 21 figures)

This paper contains 72 sections, 23 equations, 21 figures.

Figures (21)

  • Figure 1: Simple learning rates tuners qualitatively underperform their constant learning rate counterparts. Gradient descent or RMSProp with a tuned constant learning rate versus self-tuned gradient descent by a linesearch method \ref{['eq:linesearch']}, or a quadratically greedy rule \ref{['eq:quadratically_greedy']} on various datasets, architectures and losses in a full batch regime. The linesearch may perform better at early times but stalls in the long term.
  • Figure 2: Classical learning rate tuners can be effective on linear models.
  • Figure 3: Classical learning rate tuners can undershoot the edge of stability. Learning rate, sharpness, their product, and the gradient norm evolution of a constant learning rate and learning rate tuners, full batch gradient descent. Learning rate decreases by $3$ orders of magnitude for tuners ($1$st panel) while sharpness increases ($2$nd panel). Their product remains relatively steady, just below the edge of stability ($3$rd panel). The gradient norm increases by less than a factor of $10$, consistent with slow training at late times ($4$th panel).
  • Figure 4: The poor performance of classical learning rate tuners, understood in a simplified model. The dynamics of learning rate $\eta$, sharpness $\lambda_{\max}$, and normalized centered sharpness $y = \eta\lambda_{\max}-2$ are examined in the simplified model \ref{['eq:x_l_theory']}. With a constant $\eta$, $\lambda_{\max}$ stabilizes and $y$ oscillates around $0$ (blue). Classical learning rate tuners often quickly equilibrate around $y_{t} = -\epsilon$, which we model using $\eta = 1.9 \lambda_{\max}$ (orange). This equilibration of $y$ away from zero prevents stabilization in $\lambda_{\max}$, leading to an increase in $\lambda_{\max}$, and a corresponding decrease in $\eta$.
  • Figure 5: Enforcing optimizers to stay on edge ($\sigma=2.0$) improves performance over greedy approximation ($\sigma=1.0$). Train loss and learning rate behaviors for fine-tuned optimizers vs self-tuned counterparts with CDAT on various datasets, architectures, losses in a full batch regime. Tuning the learning rate "on edge" ($\sigma\approx 2$) improves performance over greedy tuning ($\sigma= 1$) as well as constant learning rate.
  • ...and 16 more figures