Table of Contents
Fetching ...

AutoGD: Automatic Learning Rate Selection for Gradient Descent

Nikola Surjanovic, Alexandre Bouchard-Côté, Trevor Campbell

TL;DR

AutoGD tackles the critical challenge of learning-rate tuning in gradient-based optimization by automatically selecting step sizes at each iteration from a small set around a baseline, including a no-movement option, and enforcing descent with an Armijo condition. The approach yields both asymptotic and nonasymptotic guarantees: it converges to a local minimum for $L$-smooth (and even locally strongly convex) objectives without knowledge of $L$ or $\mu$, and attains near-optimal GD rates up to constants under mild unimodality assumptions. The method is demonstrated to be robust across classical optimization problems and variational-inference tasks, often outperforming backtracking line search and standard GD, and its extensions to AutoBFGS and AutoLBFGS show substantial practical gains. These results suggest a broadly useful, tuning-free optimization primitive suitable for inner-loop use in larger algorithms and stochastic settings, with promising avenues for future work on richer LR grids and second-order variants. $x_{t+1}=x_t-\gamma_t\nabla f(x_t)$ and the proposed selection set $\{0, c^{-1}\gamma_t, \gamma_t, c\gamma_t\}$ form the core mechanism, with convergence proven under mild regularity conditions and Armijo-type safeguards.

Abstract

The performance of gradient-based optimization methods, such as standard gradient descent (GD), greatly depends on the choice of learning rate. However, it can require a non-trivial amount of user tuning effort to select an appropriate learning rate schedule. When such methods appear as inner loops of other algorithms, expecting the user to tune the learning rates may be impractical. To address this, we introduce AutoGD: a gradient descent method that automatically determines whether to increase or decrease the learning rate at a given iteration. We establish the convergence of AutoGD, and show that we can recover the optimal rate of GD (up to a constant) for a broad class of functions without knowledge of smoothness constants. Experiments on a variety of traditional problems and variational inference optimization tasks demonstrate strong performance of the method, along with its extensions to AutoBFGS and AutoLBFGS.

AutoGD: Automatic Learning Rate Selection for Gradient Descent

TL;DR

AutoGD tackles the critical challenge of learning-rate tuning in gradient-based optimization by automatically selecting step sizes at each iteration from a small set around a baseline, including a no-movement option, and enforcing descent with an Armijo condition. The approach yields both asymptotic and nonasymptotic guarantees: it converges to a local minimum for -smooth (and even locally strongly convex) objectives without knowledge of or , and attains near-optimal GD rates up to constants under mild unimodality assumptions. The method is demonstrated to be robust across classical optimization problems and variational-inference tasks, often outperforming backtracking line search and standard GD, and its extensions to AutoBFGS and AutoLBFGS show substantial practical gains. These results suggest a broadly useful, tuning-free optimization primitive suitable for inner-loop use in larger algorithms and stochastic settings, with promising avenues for future work on richer LR grids and second-order variants. and the proposed selection set form the core mechanism, with convergence proven under mild regularity conditions and Armijo-type safeguards.

Abstract

The performance of gradient-based optimization methods, such as standard gradient descent (GD), greatly depends on the choice of learning rate. However, it can require a non-trivial amount of user tuning effort to select an appropriate learning rate schedule. When such methods appear as inner loops of other algorithms, expecting the user to tune the learning rates may be impractical. To address this, we introduce AutoGD: a gradient descent method that automatically determines whether to increase or decrease the learning rate at a given iteration. We establish the convergence of AutoGD, and show that we can recover the optimal rate of GD (up to a constant) for a broad class of functions without knowledge of smoothness constants. Experiments on a variety of traditional problems and variational inference optimization tasks demonstrate strong performance of the method, along with its extensions to AutoBFGS and AutoLBFGS.

Paper Structure

This paper contains 26 sections, 13 theorems, 96 equations, 7 figures, 3 algorithms.

Key Result

Proposition 4.1

There exists a $\slf \geq 0$ such that the iterates $x_t$ of AutoGD satisfy $f(x_t) \downarrow \slf$.

Figures (7)

  • Figure 1: Performance of deterministic optimizers on the non-convex objective function $f(x, y) = 1 - 1/(1 + x^2 + 4y^2)$. Left: Surface plot of the objective function. Middle: Trajectories of AutoGD with initial learning rates $\gamma_0 \in \{0.001, 10.0\}$ and GD with learning rates $\gamma \in \{0.5, 10.0\}$ over 100 iterations. Here, GD with $\gamma = 0.5$ converges very slowly, while $\gamma = 10.0$ is unstable. AutoGD is stable as it approaches the minimum for different initial learning rate values. Top right: Automatically selected learning rates (on log scale) for each of the first 60 iterations. AutoGD automatically learns to anneal the learning rate in the initial phase, and then decreases the learning rate upon convergence. Bottom right: Distance to optimum (log scale) for AutoGD and GD iterates.
  • Figure 2: Two counterexamples demonstrating the importance of the Armijo condition and diffuse initialization for AutoGD. Left:\ref{['prop:counterexample_1']}. Orange dashed arrows indicate AutoGD without the Armijo condition converging to a cycle. In contrast, AutoGD with the Armijo condition (green) converges to a local minimum. Right:\ref{['prop:counterexample_2']}. Orange dashed arrows indicate AutoGD with a deterministic starting point converging to a local maximum. By using a diffuse initialization (green), AutoGD is able to avoid the local maximum almost surely and converge to a local minimum.
  • Figure 3: Percentage of runs for a given (learning rate, optimizer) combination that reach within a 1.1x level of tolerance to the best objective function value on the classical optimization test set. Left: First-order methods. Right: Second-order methods.
  • Figure 4: Percentage of runs for a given (learning rate, optimizer) combination that reach within a 1.1x level of tolerance to the best objective function value (higher is better) for various variational inference problems.
  • Figure 5: Performance of AutoGD and AdGD2 on various difficult objective functions. Objective function values (on log scale) are presented in the top row and the corresponding selected learning rates are in the bottom row. Left: Function with fat tails, $f(x) = \log(\log(1+x^2)+1)$. Optimizers are initialized at $x = 1000$. Middle: Function with rapidly changing second derivatives, $f(x) = x^2 + 0.9(1-\cos(x^2))$. Optimizers are initialized at $x = 1000$. Right: Function with rapid growth in the tails, $f(x)=x^{20}$. Optimizers are initialized at $x = 100$.
  • ...and 2 more figures

Theorems & Definitions (23)

  • Proposition 4.1
  • Theorem 4.2
  • Definition 4.3
  • Definition 4.4
  • Theorem 4.5
  • Theorem 4.9
  • Theorem 4.10
  • Theorem 4.12
  • Definition A.1
  • Definition A.2
  • ...and 13 more