Table of Contents
Fetching ...

No More Pesky Learning Rates

Tom Schaul, Sixin Zhang, Yann LeCun

TL;DR

The paper tackles the persistent challenge of tuning learning rates in SGD by deriving an optimal adaptive rate schedule under a noisy quadratic model and then presenting practical, online approximations. It introduces vSGD, a variance-based approach that can assign per-parameter or block-wise learning rates, with an adaptive memory (time-constant) and curvature estimates via bbprop, enabling automatic adjustment and responsiveness to non-stationary data. Through synthetic and large-scale neural network experiments (MNIST and CIFAR), the method consistently matches or surpasses best-tuned SGD and AdaGrad while requiring no hyper-parameter tuning. The results demonstrate robust, tuning-free optimization that adapts to changing landscapes, potentially making SGD a more user-friendly out-of-the-box optimizer for diverse problems.

Abstract

The performance of stochastic gradient descent (SGD) depends critically on how learning rates are tuned and decreased over time. We propose a method to automatically adjust multiple learning rates so as to minimize the expected error at any one time. The method relies on local gradient variations across samples. In our approach, learning rates can increase as well as decrease, making it suitable for non-stationary problems. Using a number of convex and non-convex learning tasks, we show that the resulting algorithm matches the performance of SGD or other adaptive approaches with their best settings obtained through systematic search, and effectively removes the need for learning rate tuning.

No More Pesky Learning Rates

TL;DR

The paper tackles the persistent challenge of tuning learning rates in SGD by deriving an optimal adaptive rate schedule under a noisy quadratic model and then presenting practical, online approximations. It introduces vSGD, a variance-based approach that can assign per-parameter or block-wise learning rates, with an adaptive memory (time-constant) and curvature estimates via bbprop, enabling automatic adjustment and responsiveness to non-stationary data. Through synthetic and large-scale neural network experiments (MNIST and CIFAR), the method consistently matches or surpasses best-tuned SGD and AdaGrad while requiring no hyper-parameter tuning. The results demonstrate robust, tuning-free optimization that adapts to changing landscapes, potentially making SGD a more user-friendly out-of-the-box optimizer for diverse problems.

Abstract

The performance of stochastic gradient descent (SGD) depends critically on how learning rates are tuned and decreased over time. We propose a method to automatically adjust multiple learning rates so as to minimize the expected error at any one time. The method relies on local gradient variations across samples. In our approach, learning rates can increase as well as decrease, making it suitable for non-stationary problems. Using a number of convex and non-convex learning tasks, we show that the resulting algorithm matches the performance of SGD or other adaptive approaches with their best settings obtained through systematic search, and effectively removes the need for learning rate tuning.

Paper Structure

This paper contains 24 sections, 28 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Illustration of the idealized loss function considered (thick magenta), which is the average of the quadratic contributions of each sample (dotted blue), with minima distributed around the point $\theta^*$. Note that the curvatures are assumed to be identical for all samples.
  • Figure 2: Illustration of the dynamics in a noisy quadratic bowl (with 10 times larger curvature in one dimension than the other). Trajectories of 400 steps from vSGD, and from SGD with three different learning rate schedules. SGD with fixed learning rate (crosses) descends until a certain depth (that depends on $\eta$) and then oscillates. SGD with a $1/t$ cooling schedule (pink circles) converges prematurely. On the other hand, vSGD (green triangles) is much less disrupted by the noise and continually approaches the optimum.
  • Figure 3: Optimizing a noisy quadratic loss (dimension $d=1$, curvature $h=1$). Comparison between SGD for two different fixed learning rates 1.0 and 0.2, and two cooling schedules $\eta=1/t$ and $\eta=0.2/t$, and vSGD (red circles). In dashed black, the 'oracle' computes the true optimal learning rate rather than approximating it. In the top subplot, we show the median loss from 1000 simulated runs, and below are corresponding learning rates. We observe that vSGD initially descends as fast as the SGD with the largest fixed learning rate, but then quickly reduces the learning rate which dampens the oscillations and permits a continual reduction in loss, beyond what any fixed learning rate could achieve. The best cooling schedule ($\eta=1/t$) outperforms vSGD, but when the schedule is not well tuned ($\eta=0.2/t$), the effect on the loss is catastrophic, even though the produced learning rates are very close to the oracle's (see the overlapping green crosses and the dashed black line at the bottom).
  • Figure 4: Non-stationary loss. The loss is quadratic but now the target value ($\mu$) changes abruptly every 300 time-steps. Above: loss as a function of time, below: corresponding learning rates. This illustrates the limitations of SGD with fixed or decaying learning rates (full lines): any fixed learning rate limits the precision to which the optimum can be approximated (progress stalls); any cooling schedule on the other hand cannot cope with the non-stationarity. In contrast, our adaptive setting ('vSGD', red circles), as closely resembles the optimal behavior (oracle, black dashes). The learning rate decays like $1/t$ during the static part, but increases again after each abrupt change (with just a very small delay compared to the oracle). The average loss across time is substantially better than for any SGD cooling schedule.
  • Figure 5: Training error versus test error on the three MNIST setups (after 6 epochs). Different symbol-color combinations correspond to different algorithms, with the best-tuned parameter setting shown as a much larger symbol than the other settings tried (the performance of Almeida is so bad it's off the charts). The axes are zoomed to the regions of interest for clarity, for a more global perspective, and for the corresponding plots on the CIFAR benchmarks, see Figures \ref{['fig:tvt-cifar']} and \ref{['fig:tvt-glob']}. Note that there was no tuning for our parameter-free vSGD, yet its performance is consistently good (see black circles).
  • ...and 6 more figures