Nesterov acceleration despite very noisy gradients

Kanan Gupta; Jonathan W. Siegel; Stephan Wojtowytsch

Nesterov acceleration despite very noisy gradients

Kanan Gupta, Jonathan W. Siegel, Stephan Wojtowytsch

TL;DR

A generalization of Nesterov's accelerated gradient descent algorithm that provably achieves acceleration for smooth convex and strongly convex minimization tasks with noisy gradient estimates if the noise intensity is proportional to the magnitude of the gradient at every point.

Abstract

We present a generalization of Nesterov's accelerated gradient descent algorithm. Our algorithm (AGNES) provably achieves acceleration for smooth convex and strongly convex minimization tasks with noisy gradient estimates if the noise intensity is proportional to the magnitude of the gradient at every point. Nesterov's method converges at an accelerated rate if the constant of proportionality is below 1, while AGNES accommodates any signal-to-noise ratio. The noise model is motivated by applications in overparametrized machine learning. AGNES requires only two parameters in convex and three in strongly convex minimization tasks, improving on existing methods. We further provide clear geometric interpretations and heuristics for the choice of parameters.

Nesterov acceleration despite very noisy gradients

TL;DR

Abstract

Paper Structure (38 sections, 27 theorems, 191 equations, 8 figures, 1 algorithm)

This paper contains 38 sections, 27 theorems, 191 equations, 8 figures, 1 algorithm.

Introduction
Literature Review
Accelerated first order methods.
Stochastic optimization.
Acceleration with stochastic gradients.
Algorithm and Convergence Guarantees
Assumptions
Nesterov's Method with Multiplicative Noise
AGNES Descent algorithm
Geometric Interpretation
Motivation for Multiplicative Noise
Numerical Experiments
Convex optimization
Neural network regression
Image classification
...and 23 more sections

Key Result

Theorem 1

Suppose that $x_n$ and $x'_n$ are generated by the time-stepping scheme (eq nesterov), $f$ and $g$ satisfy the conditions laid out in Section section assumptions, $f$ is convex, and $x^*$ is a point such that $f(x^*) = \inf_{x\in\mathbb{R}^m} f(x)$. If $\sigma < 1$ and the parameters are chosen such The expectation on the right hand side is over the random initialization $x_0$.

Figures (8)

Figure 1: The minimal $n$ for AGNES and SGD such that $\mathbb E[f(x_n) - \inf f]<$ when minimizing an $L$-smooth function with multiplicative noise intensity $\sigma$ in the gradient estimates and under a convexity assumption. The SGD rate of the $\mu$-strongly convex case is achieved more generally under a PL condition with PL-constant $\mu$. While SGD requires the optimal choice of one variable to achieve the optimal rate, AGNES requires three (two in the determinstic case).
Figure 2: To be able to quantify the gradient noise exactly, we choose relatively small models and data sets. Left: A ReLU network with four hidden layers of width 250 is trained by SGD to fit random labels $y_i$ (drawn from a 2-dimensional standard Gaussian) at $1,000$ random data points $x_i$ (drawn from a 500-dimensional standard Gaussian). The variance $\sigma^2$ of the gradient estimators is $\sim 10^5$ times larger than the loss function and $\sim 10^6$ times larger than the parameter gradient. This relationship is stable over approximately ten orders of magnitude. Right: A ReLU network with two hidden layers of width 50 is trained by SGD to fit the Runge function $1/(1+x^2)$ on equispaced data samples in the interval $[-8,8]$. Also here, the variance in the gradient estimates is proportional to both the loss function and the magnitude of the gradient.
Figure 3: We plot $\mathbb E[f_{d}(x_n)]$ on a loglog scale for SGD (blue), AGNES (red), NAG (green), ACDM (orange) and CNM (maroon) with $d=4$ (left) and $d=16$ (right) for noise levels $\sigma=0$ (solid line), $\sigma=10$ (dashed) and $\sigma =50$ (dotted). The initial condition is $x_0=1$ in all simulations. Means are computed over 200 runs. After an initial plateau, AGNES, CNM and ACDM significantly outperform SGD in all settings, while NAG (green) diverges if $\sigma$ is large. The length of the initial plateau increases with $\sigma$.
Figure 4: We report the training loss as a running average with decay rate 0.99 (top row) and test loss (bottom row) for batch sizes 100 (left column), 50 (middle column), and 10 (right column) in the setting of Section \ref{['section regression']}. The horizontal axis represents the number of optimizer steps. The performance gap between AGNES and other algorithms widens for smaller batch sizes, where the gradient estimates are more stochastic and the two different parameters $\alpha, \eta$ add the most benefit.
Figure 5: We report the training loss as a running average with decay rate 0.99 (top row) and test accuracy (bottom row) for ResNet-34 trained on CIFAR-10 with batch sizes 50 (left column) and 10 (middle column), and ResNet-50 trained with batch size 50 (right column). The performance of AGNES with the proposed hyperparameters is stable over the changes in model and batch size.
...and 3 more figures

Theorems & Definitions (50)

Theorem 1: NAG, convex case
Theorem 2: NAG, strongly convex case
Theorem 3: AGNES, convex case
Theorem 4: AGNES, strongly convex case
Corollary 4
Remark 5: Batching
Theorem 6: Convexity without minimizers
Lemma 6: Noise intensity
Lemma 7
proof
...and 40 more

Nesterov acceleration despite very noisy gradients

TL;DR

Abstract

Nesterov acceleration despite very noisy gradients

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (50)