Table of Contents
Fetching ...

AdaGrad-Diff: A New Version of the Adaptive Gradient Algorithm

Matia Bojovic, Saverio Salzo, Massimiliano Pontil

TL;DR

An AdaGrad-style adaptive method in which the adaptation is driven by the cumulative squared norms of successive gradient differences rather than gradient norms themselves, which is more robust than AdaGrad in several practically relevant settings.

Abstract

Vanilla gradient methods are often highly sensitive to the choice of stepsize, which typically requires manual tuning. Adaptive methods alleviate this issue and have therefore become widely used. Among them, AdaGrad has been particularly influential. In this paper, we propose an AdaGrad-style adaptive method in which the adaptation is driven by the cumulative squared norms of successive gradient differences rather than gradient norms themselves. The key idea is that when gradients vary little across iterations, the stepsize is not unnecessarily reduced, while significant gradient fluctuations, reflecting curvature or instability, lead to automatic stepsize damping. Numerical experiments demonstrate that the proposed method is more robust than AdaGrad in several practically relevant settings.

AdaGrad-Diff: A New Version of the Adaptive Gradient Algorithm

TL;DR

An AdaGrad-style adaptive method in which the adaptation is driven by the cumulative squared norms of successive gradient differences rather than gradient norms themselves, which is more robust than AdaGrad in several practically relevant settings.

Abstract

Vanilla gradient methods are often highly sensitive to the choice of stepsize, which typically requires manual tuning. Adaptive methods alleviate this issue and have therefore become widely used. Among them, AdaGrad has been particularly influential. In this paper, we propose an AdaGrad-style adaptive method in which the adaptation is driven by the cumulative squared norms of successive gradient differences rather than gradient norms themselves. The key idea is that when gradients vary little across iterations, the stepsize is not unnecessarily reduced, while significant gradient fluctuations, reflecting curvature or instability, lead to automatic stepsize damping. Numerical experiments demonstrate that the proposed method is more robust than AdaGrad in several practically relevant settings.
Paper Structure (19 sections, 11 theorems, 81 equations, 7 figures, 1 algorithm)

This paper contains 19 sections, 11 theorems, 81 equations, 7 figures, 1 algorithm.

Key Result

Theorem 2.4

Under Assumptions ass:0 and ass:1, let $(\bm{x}^n)_{n\in\mathbb{N}}$ be the sequence generated by Algorithm adadiff and suppose that $(\bm{x}^n)_{n\in\mathbb{N}}$ is bounded.This can be ensured if $\mathrm{dom}\,\varphi$ is bounded. Define, for every $n\in \mathbb{N}$, $\bar{\bm{x}}^n = (n-1)^{-1}\s

Figures (7)

  • Figure 1: On the first row, final optimality gaps after $n$ iterations across different choices for the parameter $\eta$, illustrating the sensitivity of the two methods in the setting with nonsmooth losses. On the second row, the minimization performance of AdaGrad and AdaGrad-Diff with optimally tuned stepsizes. The plots show the average and standard deviation over 10 initialization of the algorithms.
  • Figure 2: On the first row, final optimality gaps after $n$ iterations across different choices for the parameter $\eta$, illustrating the sensitivity of the two methods in the setting with smooth losses. On the second row, the minimization performance of AdaGrad and AdaGrad-Diff with optimally tuned stepsizes. The plots show the average and standard deviation over 10 initializations of the algorithms.
  • Figure 3: Comparison of AdaGrad and AdaGrad-Diff for Logistic Regression on the splice.t dataset. (a) Optimality gaps across three stepsizes parameter settings ($\eta = 0.0238,\, 0.238,\, 2.38$). (b) Stepsize evolution across different choices for the parameter $\eta$. The plots show the average and standard deviation over 10 initializations of the algorithms.
  • Figure 4: Comparison of AdaGrad and AdaGrad-Diff for the Hinge Loss on the synthetic dataset. (a) Optimality gaps across three stepsizes parameter settings ($\eta = 0.0063,\, 0.063,\, 0.63$). (b) Stepsize evolution across different choices for the parameter $\eta$. The plots show the average and standard deviation over 10 initializations of the algorithms.
  • Figure 5: Comparison of AdaGrad and AdaGrad-Diff for LAD Regression on the synthetic dataset. (a) Optimality gaps across three stepsizes parameter settings ($\eta = 0.0042,\, 0.042,\, 0.42$). (b) Stepsizes evolution across different choices for the parameter $\eta$. The plots show the average and standard deviation over 10 initializations of the algorithms.
  • ...and 2 more figures

Theorems & Definitions (23)

  • Theorem 2.4
  • Theorem 2.5
  • Lemma 3.1: The basic inequalities
  • Corollary 3.2
  • Proposition 3.3
  • proof : Proof of Theorem \ref{['thm_nonsmooth']}
  • Proposition 3.4
  • Proposition 3.5
  • Proposition 3.6
  • proof : Proof of Theorem \ref{['thm_smooth']}
  • ...and 13 more