AutoSGD: Automatic Learning Rate Selection for Stochastic Gradient Descent
Nikola Surjanovic, Alexandre Bouchard-Côté, Trevor Campbell
TL;DR
AutoSGD addresses the tuning burden of learning-rate schedules in SGD by introducing an adaptive, parameter-free scheme that operates in episodes and selects among three neighboring rates per step using forward-backward comparisons. The deterministic variant AutoGD demonstrates stable convergence and natural warmup/decay of the learning rate, with a recommended default grid $(c=1/2, C=2)$ and a formal convergence guarantee under standard smoothness and PL conditions. The stochastic version AutoSGD extends these ideas to noisy gradients via a constant-memory online decision process that uses independent noise streams to compare performance across rate options, yielding linear convergence in episode iterations. Empirical results across classical optimization tasks and ML training tasks show AutoSGD is robust to initialization and competitive with DoG and linesearch baselines while requiring little to no tuning. This work contributes a general, memory-efficient framework for adaptive, parameter-free learning-rate selection in stochastic optimization and provides avenues for further exploration of decision processes and grid design.
Abstract
The learning rate is an important tuning parameter for stochastic gradient descent (SGD) and can greatly influence its performance. However, appropriate selection of a learning rate schedule across all iterations typically requires a non-trivial amount of user tuning effort. To address this, we introduce AutoSGD: an SGD method that automatically determines whether to increase or decrease the learning rate at a given iteration and then takes appropriate action. We introduce theory supporting the convergence of AutoSGD, along with its deterministic counterpart for standard gradient descent. Empirical results suggest strong performance of the method on a variety of traditional optimization problems and machine learning tasks.
