Table of Contents
Fetching ...

On the Interplay Between Stepsize Tuning and Progressive Sharpening

Vincent Roulet, Atish Agarwala, Fabian Pedregosa

TL;DR

The paper investigates how automatic stepsize tuners interact with loss sharpness during training, focusing on Armijo line-search and Polyak SPS_max in deterministic and stochastic settings. It finds that Armijo often underperforms fixed stepsize optimization due to progressive sharpening that prevents reaching the edge of stability (EOS), while Polyak stepsizes consistently operate near EOS or slightly beyond and yield faster progress. A simple EOS-based model shows that EOS alone cannot explain the dynamics; the joint evolution of step size and the top Hessian eigenvalue must be accounted for to understand tuning behavior. The stochastic regime reveals strong batch-size dependence, underscoring that tuning effectiveness is intertwined with sampling noise, and suggesting that incorporating sharpness dynamics is essential for designing effective stepsize tuners in deep learning.

Abstract

Recent empirical work has revealed an intriguing property of deep learning models by which the sharpness (largest eigenvalue of the Hessian) increases throughout optimization until it stabilizes around a critical value at which the optimizer operates at the edge of stability, given a fixed stepsize (Cohen et al, 2022). We investigate empirically how the sharpness evolves when using stepsize-tuners, the Armijo linesearch and Polyak stepsizes, that adapt the stepsize along the iterations to local quantities such as, implicitly, the sharpness itself. We find that the surprisingly poor performance of a classical Armijo linesearch in the deterministic setting may be well explained by its tendency to ever-increase the sharpness of the objective. On the other hand, we observe that Polyak stepsizes operate generally at the edge of stability or even slightly beyond, outperforming its Armijo and constant stepsizes counterparts in the deterministic setting. We conclude with an analysis that suggests unlocking stepsize tuners requires an understanding of the joint dynamics of the step size and the sharpness.

On the Interplay Between Stepsize Tuning and Progressive Sharpening

TL;DR

The paper investigates how automatic stepsize tuners interact with loss sharpness during training, focusing on Armijo line-search and Polyak SPS_max in deterministic and stochastic settings. It finds that Armijo often underperforms fixed stepsize optimization due to progressive sharpening that prevents reaching the edge of stability (EOS), while Polyak stepsizes consistently operate near EOS or slightly beyond and yield faster progress. A simple EOS-based model shows that EOS alone cannot explain the dynamics; the joint evolution of step size and the top Hessian eigenvalue must be accounted for to understand tuning behavior. The stochastic regime reveals strong batch-size dependence, underscoring that tuning effectiveness is intertwined with sampling noise, and suggesting that incorporating sharpness dynamics is essential for designing effective stepsize tuners in deep learning.

Abstract

Recent empirical work has revealed an intriguing property of deep learning models by which the sharpness (largest eigenvalue of the Hessian) increases throughout optimization until it stabilizes around a critical value at which the optimizer operates at the edge of stability, given a fixed stepsize (Cohen et al, 2022). We investigate empirically how the sharpness evolves when using stepsize-tuners, the Armijo linesearch and Polyak stepsizes, that adapt the stepsize along the iterations to local quantities such as, implicitly, the sharpness itself. We find that the surprisingly poor performance of a classical Armijo linesearch in the deterministic setting may be well explained by its tendency to ever-increase the sharpness of the objective. On the other hand, we observe that Polyak stepsizes operate generally at the edge of stability or even slightly beyond, outperforming its Armijo and constant stepsizes counterparts in the deterministic setting. We conclude with an analysis that suggests unlocking stepsize tuners requires an understanding of the joint dynamics of the step size and the sharpness.
Paper Structure (24 sections, 9 equations, 10 figures)

This paper contains 24 sections, 9 equations, 10 figures.

Figures (10)

  • Figure 1: Interplay between stepsize tuners and sharpness. We plot the training loss (top) and sharpness (bottom) for various architectures on a subset of CIFAR10, for the different stepsize tuners (Armijo-GD, Polyak-GD) as well as GD with fixed stepsizes $\gamma$, all in the full-batch setting. The sharpness of GD stabilizes around the value $2/\gamma$ (dashed line). While Armijo-GD decreases the objective monotonically, the sharpness climbs further above any other method. In contrast, the train loss of Polyak-GD is not monotonically decreasing but the sharpness plateaus at low values.
  • Figure 2: A closer look at the learning rate dynamics. In the top plot, we show the learning rate dynamics for the different stepsize tuners and GD. The Armijo backtracking line-search (red) decreases the learning rate monotonically, while the Polyak stepsize (violet) oscillates around the maximal acceptable value from \ref{['eq:polyak']}, $\gamma_{\max} = 1$. In the bottom plot, we show the product of the learning rate and the sharpness of the Hessian. For constant stepsize GD, this product stabilizes at the critical EOS value $2$ (black dashed line). For Armijo-GD the value either stays well below $2$ or lingers around it from the start. For Polyak-GD, the product oscillates around the critical value without stabilizing like GD.
  • Figure 3: Batch sizes impact the behavior of stepsize tuners in the stochastic regime. We plot training loss, sharpness, learning rate, and normalized sharpness for training an MLP (see Appendix for VGG and ResNet) on the full CIFAR10 dataset with various batch-sizes. In this stochastic regime, the performance of Armijo vary greatly with the mini-batch considered, with a good performance at medium scale, and poor performance otherwise. Progressive sharpening of Armijo is only observed at medium and large scales. At small scale, Armijo displays an "instantaneous" sharpening that hinders its performance just as with large stepsizes SGD. Polyak performs reasonably well in all settings, while operating potentially above the edge of stability regime in, e.g., the medium scale.
  • Figure 4: Training dynamics in stochastic regime for the VGG11 architecture.
  • Figure 5: Training dynamics in stochastic regime for the ResNet34 architecture.
  • ...and 5 more figures