Benefits of Learning Rate Annealing for Tuning-Robustness in Stochastic Optimization
Amit Attia, Tomer Koren
TL;DR
This work investigates learning-rate tuning for SGD in stochastic convex optimization and argues that annealed schedules, particularly cosine and polynomial decay, offer robustness to coarse grid misspecification of the stepsize. By defining schedule-function-derived quantities $H_h$ and $Q_h$, the authors derive convergence bounds showing sublinear dependence on the misspecification factor $\rho$ (e.g., $\rho^{1/(2p+1)}$ for polynomial decay and $\rho^{1/5}$ for cosine) in the non-smooth convex setting, with analogous results in the smooth setting. They provide precise corollaries for common schedules, numerical insights into constants, and experimental validation on synthetic logistic regression and CIFAR-10 with Wide ResNet, demonstrating improved tuning robustness over fixed-step SGD under coarse grids. The results imply substantial reductions in hyperparameter search overhead in large-scale training, guiding the design of more robust learning-rate schedules for practical deployment. Overall, the paper advances theoretical understanding and practical applicability of learning-rate annealing as a tool for tuning-robust stochastic optimization.
Abstract
The learning rate in stochastic gradient methods is a critical hyperparameter that is notoriously costly to tune via standard grid search, especially for training modern large-scale models with billions of parameters. We identify a theoretical advantage of learning rate annealing schemes that decay the learning rate to zero at a polynomial rate, such as the widely-used cosine schedule, by demonstrating their increased robustness to initial parameter misspecification due to a coarse grid search. We present an analysis in a stochastic convex optimization setup demonstrating that the convergence rate of stochastic gradient descent with annealed schedules depends sublinearly on the multiplicative misspecification factor $ρ$ (i.e., the grid resolution), achieving a rate of $O(ρ^{1/(2p+1)}/\sqrt{T})$ where $p$ is the degree of polynomial decay and $T$ is the number of steps, in contrast to the $O(ρ/\sqrt{T})$ rate that arises with fixed stepsizes and exhibits a linear dependence on $ρ$. Experiments confirm the increased robustness compared to tuning with a fixed stepsize, that has significant implications for the computational overhead of hyperparameter search in practical training scenarios.
