Table of Contents
Fetching ...

Enhancing Stochastic Gradient Descent: A Unified Framework and Novel Acceleration Methods for Faster Convergence

Yichuan Deng, Zhao Song, Chiwun Yang

TL;DR

This work addresses accelerating stochastic gradient methods in non-convex settings by introducing a unified first-order framework that decomposes the update direction into a gradient term and an acceleration term $v_t$, with adaptive scaling $\eta_t = \frac{2|\langle v_t, \nabla f_t(x_t) \rangle|}{\|v_t\|_2^2}$. It proves a general convergence bound $\min_{t\in[T]} \mathbb{E}[\|\nabla f(x_t)\|_2^2] \le \frac{\sqrt{T+8kB}}{T+2ku_a-2lu_b}\sqrt{2(f(x_0)-f(x^*))L\sigma^2}$ and introduces two plug-and-play methods, Reject Accelerating and Random Vector Accelerating, that can tighten the bound by adjusting $k,l,u_a,u_b,B$ or by exploiting Gaussian directions. The authors provide formal lemmas and a main theorem, plus experimental validation on image and language tasks showing that RA and RVA can speed up convergence for several optimizers (with some caveats like Adam’s performance). Overall, the framework offers a principled path to faster stochastic optimization in non-convex regimes and suggests practical acceleration strategies with predictable gains.

Abstract

Based on SGD, previous works have proposed many algorithms that have improved convergence speed and generalization in stochastic optimization, such as SGDm, AdaGrad, Adam, etc. However, their convergence analysis under non-convex conditions is challenging. In this work, we propose a unified framework to address this issue. For any first-order methods, we interpret the updated direction $g_t$ as the sum of the stochastic subgradient $\nabla f_t(x_t)$ and an additional acceleration term $\frac{2|\langle v_t, \nabla f_t(x_t) \rangle|}{\|v_t\|_2^2} v_t$, thus we can discuss the convergence by analyzing $\langle v_t, \nabla f_t(x_t) \rangle$. Through our framework, we have discovered two plug-and-play acceleration methods: \textbf{Reject Accelerating} and \textbf{Random Vector Accelerating}, we theoretically demonstrate that these two methods can directly lead to an improvement in convergence rate.

Enhancing Stochastic Gradient Descent: A Unified Framework and Novel Acceleration Methods for Faster Convergence

TL;DR

This work addresses accelerating stochastic gradient methods in non-convex settings by introducing a unified first-order framework that decomposes the update direction into a gradient term and an acceleration term , with adaptive scaling . It proves a general convergence bound and introduces two plug-and-play methods, Reject Accelerating and Random Vector Accelerating, that can tighten the bound by adjusting or by exploiting Gaussian directions. The authors provide formal lemmas and a main theorem, plus experimental validation on image and language tasks showing that RA and RVA can speed up convergence for several optimizers (with some caveats like Adam’s performance). Overall, the framework offers a principled path to faster stochastic optimization in non-convex regimes and suggests practical acceleration strategies with predictable gains.

Abstract

Based on SGD, previous works have proposed many algorithms that have improved convergence speed and generalization in stochastic optimization, such as SGDm, AdaGrad, Adam, etc. However, their convergence analysis under non-convex conditions is challenging. In this work, we propose a unified framework to address this issue. For any first-order methods, we interpret the updated direction as the sum of the stochastic subgradient and an additional acceleration term , thus we can discuss the convergence by analyzing . Through our framework, we have discovered two plug-and-play acceleration methods: \textbf{Reject Accelerating} and \textbf{Random Vector Accelerating}, we theoretically demonstrate that these two methods can directly lead to an improvement in convergence rate.
Paper Structure (53 sections, 13 theorems, 54 equations, 2 figures, 1 table, 2 algorithms)

This paper contains 53 sections, 13 theorems, 54 equations, 2 figures, 1 table, 2 algorithms.

Key Result

Lemma 4.7

$f: \mathbb{R}^d \rightarrow \mathbb{R}$ is a $L$-smooth function and has $\sigma$-bounded gradients. Denote $T > 0$ as a positive integer, for a stochastic iterative first-order optimization algorithm that implements $T$ times. For $t \in [T]$, we follows Definition def:v:informal to write it as $x

Figures (2)

  • Figure 1: Experimental results of applying Reject Accelerating to Adam, Lion, and SGDm optimizers on Cifar-10 dataset.
  • Figure 2: Experimental results of applying Random Vector Accelerating to Adam and SGD optimizers on Cifar-100 and Penn Treebank datasets separately.

Theorems & Definitions (40)

  • Definition 3.1
  • Definition 3.3
  • Definition 3.4: Convergence rate
  • Definition 4.1: Additional accelerating term $v_t$
  • Definition 4.2
  • Definition 4.3
  • Definition 4.4
  • Definition 4.5
  • Definition 4.6
  • Lemma 4.7: Informal version of Lemma \ref{['lem:case1']}
  • ...and 30 more