Table of Contents
Fetching ...

AutoSGD: Automatic Learning Rate Selection for Stochastic Gradient Descent

Nikola Surjanovic, Alexandre Bouchard-Côté, Trevor Campbell

TL;DR

AutoSGD addresses the tuning burden of learning-rate schedules in SGD by introducing an adaptive, parameter-free scheme that operates in episodes and selects among three neighboring rates per step using forward-backward comparisons. The deterministic variant AutoGD demonstrates stable convergence and natural warmup/decay of the learning rate, with a recommended default grid $(c=1/2, C=2)$ and a formal convergence guarantee under standard smoothness and PL conditions. The stochastic version AutoSGD extends these ideas to noisy gradients via a constant-memory online decision process that uses independent noise streams to compare performance across rate options, yielding linear convergence in episode iterations. Empirical results across classical optimization tasks and ML training tasks show AutoSGD is robust to initialization and competitive with DoG and linesearch baselines while requiring little to no tuning. This work contributes a general, memory-efficient framework for adaptive, parameter-free learning-rate selection in stochastic optimization and provides avenues for further exploration of decision processes and grid design.

Abstract

The learning rate is an important tuning parameter for stochastic gradient descent (SGD) and can greatly influence its performance. However, appropriate selection of a learning rate schedule across all iterations typically requires a non-trivial amount of user tuning effort. To address this, we introduce AutoSGD: an SGD method that automatically determines whether to increase or decrease the learning rate at a given iteration and then takes appropriate action. We introduce theory supporting the convergence of AutoSGD, along with its deterministic counterpart for standard gradient descent. Empirical results suggest strong performance of the method on a variety of traditional optimization problems and machine learning tasks.

AutoSGD: Automatic Learning Rate Selection for Stochastic Gradient Descent

TL;DR

AutoSGD addresses the tuning burden of learning-rate schedules in SGD by introducing an adaptive, parameter-free scheme that operates in episodes and selects among three neighboring rates per step using forward-backward comparisons. The deterministic variant AutoGD demonstrates stable convergence and natural warmup/decay of the learning rate, with a recommended default grid and a formal convergence guarantee under standard smoothness and PL conditions. The stochastic version AutoSGD extends these ideas to noisy gradients via a constant-memory online decision process that uses independent noise streams to compare performance across rate options, yielding linear convergence in episode iterations. Empirical results across classical optimization tasks and ML training tasks show AutoSGD is robust to initialization and competitive with DoG and linesearch baselines while requiring little to no tuning. This work contributes a general, memory-efficient framework for adaptive, parameter-free learning-rate selection in stochastic optimization and provides avenues for further exploration of decision processes and grid design.

Abstract

The learning rate is an important tuning parameter for stochastic gradient descent (SGD) and can greatly influence its performance. However, appropriate selection of a learning rate schedule across all iterations typically requires a non-trivial amount of user tuning effort. To address this, we introduce AutoSGD: an SGD method that automatically determines whether to increase or decrease the learning rate at a given iteration and then takes appropriate action. We introduce theory supporting the convergence of AutoSGD, along with its deterministic counterpart for standard gradient descent. Empirical results suggest strong performance of the method on a variety of traditional optimization problems and machine learning tasks.

Paper Structure

This paper contains 8 sections, 3 theorems, 12 equations, 3 figures, 1 algorithm.

Key Result

Proposition 4.1

Let $f$ be differentiable. The iterates $f(x_t)$ of AutoGD converge (not necessarily to a local minimum): $\lim_{t \to \infty} f(x_t) = \slf$, for some $\slf \geq 0$.

Figures (3)

  • Figure 1: Performance of deterministic optimizers on the non-convex objective function $f(x, y) = 1 - 1/(1 + x^2 + 4y^2)$. Left: Surface plot of the objective function. Middle: Trajectories of AutoGD with initial learning rates $\gamma_0 \in \{0.001, 10.0\}$ and GD with learning rates $\gamma \in \{0.5, 10.0\}$ over 100 iterations. Here, GD with $\gamma = 0.5$ converges very slowly, while $\gamma = 10.0$ is unstable. AutoGD is stable as it approaches the minimum for different learning rate values. Top right: Automatically selected learning rates (on log scale) for each of the first 60 iterations. AutoGD automatically learns to anneal the learning rate in the initial phase, and then decreases the learning rate upon convergence. Bottom right: Distance to optimum (log scale) for AutoGD and GD iterates.
  • Figure 2: Example of the AutoSGD learning rate selection procedure with $C = 1/c = 5$ (a larger value of $C$ is used for better visualization). Episode endpoints are indicated as black vertical lines and the middle learning rate within the episode is indicated at the top of each section. The exact objective function value is known at episode endpoints here, but is typically estimated in practice. Bold lines indicate winning trajectories. Episode 0: The highest learning rate is selected for the next episode. Episode 1: The smallest learning rate is selected. Episode 2: The middle learning rate is selected. Episode 3: Evidence of function increase at all learning rates. Decrease by $1/C^2$ and restart at the previous episode's starting point $x_{t-1}$. Episode 4: In progress.
  • Figure 3: Performance of the new learning rate selection procedure within AutoSGD (plotted on log-log scale) on the "sum of quadratics" problem. AutoSGD is initialized with four different learning rates $\gamma \in \{10^{-1}, 10^{-2}, 10^{-3}, 10^{-5}\}$. All three initializations learn to automatically warm up the learning rate and eventually converge and decay at a rate of approximately $O(1/t)$ in this example.

Theorems & Definitions (3)

  • Proposition 4.1
  • Proposition 4.2
  • Theorem 4.5