Table of Contents
Fetching ...

Lookahead Optimizer: k steps forward, 1 step back

Michael R. Zhang, James Lucas, Geoffrey Hinton, Jimmy Ba

TL;DR

The paper identifies inefficiencies in conventional SGD-based training, notably hyperparameter sensitivity and variance in updates. It introduces Lookahead, a two-weight optimization scheme that alternates inner fast updates with an outer slow-weight interpolation towards the fast weights, yielding variance reduction and improved stability. Theoretical analysis on noisy and deterministic quadratic models supports faster convergence and reduced steady-state variance, while extensive experiments across CIFAR, ImageNet, language modeling, and machine translation demonstrate faster convergence and robust performance with modest overhead. Overall, Lookahead provides a practical, broadly compatible enhancement to standard optimizers, reducing hyperparameter sensitivity and improving training efficiency in deep learning.

Abstract

The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of fast weights generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.

Lookahead Optimizer: k steps forward, 1 step back

TL;DR

The paper identifies inefficiencies in conventional SGD-based training, notably hyperparameter sensitivity and variance in updates. It introduces Lookahead, a two-weight optimization scheme that alternates inner fast updates with an outer slow-weight interpolation towards the fast weights, yielding variance reduction and improved stability. Theoretical analysis on noisy and deterministic quadratic models supports faster convergence and reduced steady-state variance, while extensive experiments across CIFAR, ImageNet, language modeling, and machine translation demonstrate faster convergence and robust performance with modest overhead. Overall, Lookahead provides a practical, broadly compatible enhancement to standard optimizers, reducing hyperparameter sensitivity and improving training efficiency in deep learning.

Abstract

The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of fast weights generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.

Paper Structure

This paper contains 38 sections, 3 theorems, 21 equations, 17 figures, 6 tables, 1 algorithm.

Key Result

Proposition 1

For a quadratic loss function $L(x) = \frac{1}{2} x^T A x - b^T x$, the step size $\alpha^*$ that minimizes the loss for two points $\theta_{t,0}$ and $\theta_{t,k}$ is given by: where $\theta^* = A^{-1}b$ minimizes the loss.

Figures (17)

  • Figure 1: Lookahead Optimizer:
  • Figure 2: CIFAR-10 training loss with fixed and adaptive $\alpha$. The adaptive $\alpha$ is clipped between $[\alpha_{low}, 1]$. (Left) Adam learning rate = 0.001. (Right) Adam learning rate = 0.003.
  • Figure 3: Comparing expected optimization progress between SGD and Lookahead($k=5$) on the noisy quadratic model. Each vertical slice compares the convergence of optimizers with the same final loss values. For Lookahead, convergence rates for 100 evenly spaced $\alpha$ values in the range $(0,1]$ are overlaid.
  • Figure 4: Quadratic convergence rates ($1-\rho$) of classical momentum versus Lookahead wrapping classical momentum. For Lookahead, we fix $k=20$ lookahead steps and $\alpha=0.5$ for the slow weights step size. Lookahead is able to significantly improve on the convergence rate in the under-damped regime where oscillations are observed.
  • Figure 5: CIFAR Final Validation Accuracy.
  • ...and 12 more figures

Theorems & Definitions (6)

  • Proposition 1: Optimal slow weights step size
  • Proposition 2: Lookahead steady-state risk
  • Lemma 1
  • proof
  • proof
  • proof