Stepping on the Edge: Curvature Aware Learning Rate Tuners
Vincent Roulet, Atish Agarwala, Jean-Bastien Grill, Grzegorz Swirszcz, Mathieu Blondel, Fabian Pedregosa
TL;DR
The paper investigates how curvature dynamics, particularly the largest Hessian eigenvalue $ abla^2 f$ and sharpness, interact with learning-rate tuners in deep learning. It shows that classical tuners such as linesearch and quadratic-greedy rules underperform fixed learning rates in the full-batch regime due to undershooting the edge of stability (EOS) and destabilizing sharpness, leading to slower long-term progress. To address this, the authors propose Curvature Dynamics Aware Tuning (CDAT), which aims to keep the optimizer near EOS by setting $ ext{LR}_t=oldsymbol{c3}rac{n_t}{d_t}$ with $n_t= ext{max}ig\{- abla f(w_t)^ op u_t,0ig\ig o$ and $d_t=|u_t^ op abla^2 f(w_t)u_t|+oldsymbol{b epsilon}$, and smoothing via EMA; the scaling factor $oldsymbol{c3}$ interpolates between greedy and on-edge behavior. Empirically, CDAT matches or outperforms tuned constant learning rates in the full-batch regime and exhibits warm-up-like LR progression that stabilizes sharpness, reducing progressive sharpening. In stochastic mini-batch settings, stochasticity mitigates some EOS feedback, reducing CDAT’s advantages and making batch-size dependent scaling important; a simple theoretical model clarifies the joint dynamics of LR and sharpness and highlights limitations of existing models. Overall, the work argues that leveraging curvature stabilization, rather than greedy loss decrease, yields more reliable long-term training and suggests directions for integrating curvature-aware control into adaptive optimizers.
Abstract
Curvature information -- particularly, the largest eigenvalue of the loss Hessian, known as the sharpness -- often forms the basis for learning rate tuners. However, recent work has shown that the curvature information undergoes complex dynamics during training, going from a phase of increasing sharpness to eventual stabilization. We analyze the closed-loop feedback effect between learning rate tuning and curvature. We find that classical learning rate tuners may yield greater one-step loss reduction, yet they ultimately underperform in the long term when compared to constant learning rates in the full batch regime. These models break the stabilization of the sharpness, which we explain using a simplified model of the joint dynamics of the learning rate and the curvature. To further investigate these effects, we introduce a new learning rate tuning method, Curvature Dynamics Aware Tuning (CDAT), which prioritizes long term curvature stabilization over instantaneous progress on the objective. In the full batch regime, CDAT shows behavior akin to prefixed warm-up schedules on deep learning objectives, outperforming tuned constant learning rates. In the mini batch regime, we observe that stochasticity introduces confounding effects that explain the previous success of some learning rate tuners at appropriate batch sizes. Our findings highlight the critical role of understanding the joint dynamics of the learning rate and curvature, beyond greedy minimization, to diagnose failures and design effective adaptive learning rate tuners.
