Table of Contents
Fetching ...

Gradient correlation is a key ingredient to accelerate SGD with momentum

Julien Hermant, Marien Renaud, Jean-François Aujol, Charles Dossal, Aude Rondepierre

TL;DR

The paper tackles whether stochastic momentum (SNAG) can accelerate SGD in convex finite-sum problems under interpolation. It introduces RACOGA, a gradient-correlation condition, and shows how it governs the Strong Growth Condition constant ρ_K, thereby determining when SNAG can beat SGD. The work provides both finite-time and almost-sure convergence results under RACOGA and its relaxations, plus a theory for how batch size interacts with gradient correlation. Empirical results on linear regression and neural networks corroborate that higher gradient correlation along the optimization path yields faster SNAG performance, validating the proposed framework and its practical implications for choosing batch sizes and hyperparameters.

Abstract

Empirically, it has been observed that adding momentum to Stochastic Gradient Descent (SGD) accelerates the convergence of the algorithm. However, the literature has been rather pessimistic, even in the case of convex functions, about the possibility of theoretically proving this observation. We investigate the possibility of obtaining accelerated convergence of the Stochastic Nesterov Accelerated Gradient (SNAG), a momentum-based version of SGD, when minimizing a sum of functions in a convex setting. We demonstrate that the average correlation between gradients allows to verify the strong growth condition, which is the key ingredient to obtain acceleration with SNAG. Numerical experiments, both in linear regression and deep neural network optimization, confirm in practice our theoretical results.

Gradient correlation is a key ingredient to accelerate SGD with momentum

TL;DR

The paper tackles whether stochastic momentum (SNAG) can accelerate SGD in convex finite-sum problems under interpolation. It introduces RACOGA, a gradient-correlation condition, and shows how it governs the Strong Growth Condition constant ρ_K, thereby determining when SNAG can beat SGD. The work provides both finite-time and almost-sure convergence results under RACOGA and its relaxations, plus a theory for how batch size interacts with gradient correlation. Empirical results on linear regression and neural networks corroborate that higher gradient correlation along the optimization path yields faster SNAG performance, validating the proposed framework and its practical implications for choosing batch sizes and hyperparameters.

Abstract

Empirically, it has been observed that adding momentum to Stochastic Gradient Descent (SGD) accelerates the convergence of the algorithm. However, the literature has been rather pessimistic, even in the case of convex functions, about the possibility of theoretically proving this observation. We investigate the possibility of obtaining accelerated convergence of the Stochastic Nesterov Accelerated Gradient (SNAG), a momentum-based version of SGD, when minimizing a sum of functions in a convex setting. We demonstrate that the average correlation between gradients allows to verify the strong growth condition, which is the key ingredient to obtain acceleration with SNAG. Numerical experiments, both in linear regression and deep neural network optimization, confirm in practice our theoretical results.

Paper Structure

This paper contains 88 sections, 27 theorems, 223 equations, 15 figures, 2 tables, 11 algorithms.

Key Result

Theorem 1

Under Assumptions ass:interpolation and ass:l_smooth, SGD (Algorithm alg:SGD) guarantees to reach an $\varepsilon$-precision (eq:epsilon_pres) at the following iterations: $\bullet$ If $f$ is convex, $s = \frac{1}{2L_{(K)}}$, $\bullet$ If $f$ is $\mu$-strongly convex, $s=\frac{1}{L_{(K)}}$,

Figures (15)

  • Figure 1: Illustration of the convergence speed of GD (Algorithm \ref{['alg:GD']}), SGD (Algorithm \ref{['alg:SGD']}, batch size $1$), NAG (Algorithm \ref{['alg:NAG']}) and SNAG (Algorithm \ref{['alg:SNAG']}, batch size $1$) on a linear regression problem, together with an histogram distribution of \ref{['ass:racoga']} values along the iterations of SNAG. Stochastic algorithms results are averaged over ten runs. On the left, data are generated by a law that make them fewly correlated, while on the right the data are generated by a gaussian mixture, leading to higher correlation. The $\lambda$ parameter replaces the unknown \ref{['SGC']} constant in the algorithm, see Appendix \ref{['app:lin_reg_details']}. Note that the data correlation results in better performance of SNAG, whereas uncorrelated data lead to smaller \ref{['ass:racoga']} values, reducing the benefit of using SNAG.
  • Figure 2: Illustration of the convergence speed of GD (Algorithm \ref{['alg:GD']}), SGD (Algorithm \ref{['alg:SGD']}, batch size $64$), NAG (Algorithm \ref{['alg:SNAG:ML']}, full batch) and SNAG (Algorithm \ref{['alg:SNAG:ML']}, batch size $64$) averaged over $10$ different initializations, together with an histogram distribution of \ref{['ass:racoga']} values taken along the optimization path, averaged over $10$ different initialisations, where the $x$-axis scale is logarithmic. On the left, we use a MLP to classify data sampled from a law such that they are fewly correlated. On the right we use a CNN to classify CIFAR10 images. Note that contrarily to Figure \ref{['fig:cv_linear_regression']}, the presence of correlation within data no longer influence the \ref{['ass:racoga']} values, that remains high in both cases, resulting in better performances of SNAG.
  • Figure 3: Illustration of the convergence speed of GD (Algorithm \ref{['alg:GD']}), NAG (Algorithm \ref{['alg:NAG']}), SGD (Algorithm \ref{['alg:SGD']}) and SNAG (Algorithm \ref{['alg:SNAG']}) with varying batch sizes $K$, applied to a linear regression problem, together with an histogram distribution of \ref{['ass:racoga']} values along the iterations of SNAG. The stochastic algorithms results are average on ten runs. On the left, data are generated by a law such that they are fewly correlated, while on the right the data are generated by a gaussian mixture, such that some of the data are highly correlated. Note that the presence of correlation in data results in a decrease of performance for SNAG (Algorithm \ref{['alg:SNAG']}) when increasing too much the batch size, whereas uncorrelated data results in an improvement of performance when increasing batch size.
  • Figure 4: Illustration of the convergence speed of GD (Algorithm \ref{['alg:GD']}), NAG (Algorithm \ref{['alg:NAG']}), SGD (Algorithm \ref{['alg:SGD']}) and SNAG (Algorithm \ref{['alg:SNAG']}) for the linear regression problem with the Boston dataset. The parameters of SNAG are choosen accordingly to Equation \ref{['eq:app_lin_reg_param']}. Some \ref{['ass:racoga']} values are high, while other are also close to the worst anti-correlation case $-\frac{1}{2}$. Note that SNAG appears to converge faster than the other algorithms when choosing $\lambda$ small enough.
  • Figure 5: Illustration of the two classification problems we consider in Section \ref{['sec:xp_cnn']}. On the left part, wee see a 2D vizualisation of CIFAR10 data set, proposed by balasubramanian2022contrastivelearningoodobject. On the right, we illustrate SPHERE dataset on the $3d$-sphere, where each hemisphere correspond to a different label.
  • ...and 10 more figures

Theorems & Definitions (56)

  • Definition 1
  • Definition 2
  • Definition 3
  • Example 1
  • Remark 1
  • Remark 2
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Remark 3
  • ...and 46 more