Table of Contents
Fetching ...

Benefits of Learning Rate Annealing for Tuning-Robustness in Stochastic Optimization

Amit Attia, Tomer Koren

TL;DR

This work investigates learning-rate tuning for SGD in stochastic convex optimization and argues that annealed schedules, particularly cosine and polynomial decay, offer robustness to coarse grid misspecification of the stepsize. By defining schedule-function-derived quantities $H_h$ and $Q_h$, the authors derive convergence bounds showing sublinear dependence on the misspecification factor $\rho$ (e.g., $\rho^{1/(2p+1)}$ for polynomial decay and $\rho^{1/5}$ for cosine) in the non-smooth convex setting, with analogous results in the smooth setting. They provide precise corollaries for common schedules, numerical insights into constants, and experimental validation on synthetic logistic regression and CIFAR-10 with Wide ResNet, demonstrating improved tuning robustness over fixed-step SGD under coarse grids. The results imply substantial reductions in hyperparameter search overhead in large-scale training, guiding the design of more robust learning-rate schedules for practical deployment. Overall, the paper advances theoretical understanding and practical applicability of learning-rate annealing as a tool for tuning-robust stochastic optimization.

Abstract

The learning rate in stochastic gradient methods is a critical hyperparameter that is notoriously costly to tune via standard grid search, especially for training modern large-scale models with billions of parameters. We identify a theoretical advantage of learning rate annealing schemes that decay the learning rate to zero at a polynomial rate, such as the widely-used cosine schedule, by demonstrating their increased robustness to initial parameter misspecification due to a coarse grid search. We present an analysis in a stochastic convex optimization setup demonstrating that the convergence rate of stochastic gradient descent with annealed schedules depends sublinearly on the multiplicative misspecification factor $ρ$ (i.e., the grid resolution), achieving a rate of $O(ρ^{1/(2p+1)}/\sqrt{T})$ where $p$ is the degree of polynomial decay and $T$ is the number of steps, in contrast to the $O(ρ/\sqrt{T})$ rate that arises with fixed stepsizes and exhibits a linear dependence on $ρ$. Experiments confirm the increased robustness compared to tuning with a fixed stepsize, that has significant implications for the computational overhead of hyperparameter search in practical training scenarios.

Benefits of Learning Rate Annealing for Tuning-Robustness in Stochastic Optimization

TL;DR

This work investigates learning-rate tuning for SGD in stochastic convex optimization and argues that annealed schedules, particularly cosine and polynomial decay, offer robustness to coarse grid misspecification of the stepsize. By defining schedule-function-derived quantities and , the authors derive convergence bounds showing sublinear dependence on the misspecification factor (e.g., for polynomial decay and for cosine) in the non-smooth convex setting, with analogous results in the smooth setting. They provide precise corollaries for common schedules, numerical insights into constants, and experimental validation on synthetic logistic regression and CIFAR-10 with Wide ResNet, demonstrating improved tuning robustness over fixed-step SGD under coarse grids. The results imply substantial reductions in hyperparameter search overhead in large-scale training, guiding the design of more robust learning-rate schedules for practical deployment. Overall, the paper advances theoretical understanding and practical applicability of learning-rate annealing as a tool for tuning-robust stochastic optimization.

Abstract

The learning rate in stochastic gradient methods is a critical hyperparameter that is notoriously costly to tune via standard grid search, especially for training modern large-scale models with billions of parameters. We identify a theoretical advantage of learning rate annealing schemes that decay the learning rate to zero at a polynomial rate, such as the widely-used cosine schedule, by demonstrating their increased robustness to initial parameter misspecification due to a coarse grid search. We present an analysis in a stochastic convex optimization setup demonstrating that the convergence rate of stochastic gradient descent with annealed schedules depends sublinearly on the multiplicative misspecification factor (i.e., the grid resolution), achieving a rate of where is the degree of polynomial decay and is the number of steps, in contrast to the rate that arises with fixed stepsizes and exhibits a linear dependence on . Experiments confirm the increased robustness compared to tuning with a fixed stepsize, that has significant implications for the computational overhead of hyperparameter search in practical training scenarios.

Paper Structure

This paper contains 35 sections, 17 theorems, 104 equations, 3 figures.

Key Result

lemma 1

Let $\mathcal{X} \subset {\mathbb{R}}^d$ be a convex set with diameter $D > 0$, $f : \mathcal{X} \to {\mathbb{R}}$ a convex function, $x^\star \in \mathop{\mathrm{arg\,min}}\limits_{x \in \mathcal{X}} f(x)$, and $g : \mathcal{X} \to {\mathbb{R}}^d$ an unbiased first-order oracle of $f$ with second-m

Figures (3)

  • Figure 1: Numerically evaluating the coefficient of ${D G}/{\sqrt{T}}$ for the convergence guarantee of \ref{['thm:main-non-smooth']} with different schedules and varying multiplicative misspecification parameter $\rho$.
  • Figure 2: (a) Test loss for the logistic regression task with varying learning rates and different learning rate schedules. Each point represents 3 runs, reporting average and standard deviation. (b) Test loss of the best model in a sub-grid averaged over multiple sub-grids with the same multiplicative grid factor. The smallest multiplicative factor represents the full grid of (a). "Fixed stepsize w/ AVG" stands for fixed stepsize SGD with iterate averaging.
  • Figure 3: (a) CIFAR-10 top-1 test error of WideResNet28-10 with varying learning rates and different learning rate schedules. Each point represents 3 runs, reporting average and standard deviation. (b) Test error of the best model in a sub-grid averaged over multiple sub-grids with the same multiplicative grid factor. The smallest multiplicative factor represents the full grid of (a). "Fixed stepsize w/ AVG" stands for fixed stepsize SGD with polynomial iterate averaging.

Theorems & Definitions (29)

  • lemma 1
  • lemma 2
  • theorem 1
  • corollary 2
  • proof : Proof of \ref{['cor:poly-non-smooth']}
  • corollary 3
  • lemma 3
  • lemma 4
  • proof : Proof of \ref{['thm:main-non-smooth']}
  • proof
  • ...and 19 more