Enhancing Stochastic Gradient Descent: A Unified Framework and Novel Acceleration Methods for Faster Convergence

Yichuan Deng; Zhao Song; Chiwun Yang

Enhancing Stochastic Gradient Descent: A Unified Framework and Novel Acceleration Methods for Faster Convergence

Yichuan Deng, Zhao Song, Chiwun Yang

TL;DR

This work addresses accelerating stochastic gradient methods in non-convex settings by introducing a unified first-order framework that decomposes the update direction into a gradient term and an acceleration term $v_t$, with adaptive scaling $\eta_t = \frac{2|\langle v_t, \nabla f_t(x_t) \rangle|}{\|v_t\|_2^2}$. It proves a general convergence bound $\min_{t\in[T]} \mathbb{E}[\|\nabla f(x_t)\|_2^2] \le \frac{\sqrt{T+8kB}}{T+2ku_a-2lu_b}\sqrt{2(f(x_0)-f(x^*))L\sigma^2}$ and introduces two plug-and-play methods, Reject Accelerating and Random Vector Accelerating, that can tighten the bound by adjusting $k,l,u_a,u_b,B$ or by exploiting Gaussian directions. The authors provide formal lemmas and a main theorem, plus experimental validation on image and language tasks showing that RA and RVA can speed up convergence for several optimizers (with some caveats like Adam’s performance). Overall, the framework offers a principled path to faster stochastic optimization in non-convex regimes and suggests practical acceleration strategies with predictable gains.

Abstract

Based on SGD, previous works have proposed many algorithms that have improved convergence speed and generalization in stochastic optimization, such as SGDm, AdaGrad, Adam, etc. However, their convergence analysis under non-convex conditions is challenging. In this work, we propose a unified framework to address this issue. For any first-order methods, we interpret the updated direction $g_t$ as the sum of the stochastic subgradient $\nabla f_t(x_t)$ and an additional acceleration term $\frac{2|\langle v_t, \nabla f_t(x_t) \rangle|}{\|v_t\|_2^2} v_t$, thus we can discuss the convergence by analyzing $\langle v_t, \nabla f_t(x_t) \rangle$. Through our framework, we have discovered two plug-and-play acceleration methods: \textbf{Reject Accelerating} and \textbf{Random Vector Accelerating}, we theoretically demonstrate that these two methods can directly lead to an improvement in convergence rate.

Enhancing Stochastic Gradient Descent: A Unified Framework and Novel Acceleration Methods for Faster Convergence

TL;DR

, with adaptive scaling

. It proves a general convergence bound

and introduces two plug-and-play methods, Reject Accelerating and Random Vector Accelerating, that can tighten the bound by adjusting

or by exploiting Gaussian directions. The authors provide formal lemmas and a main theorem, plus experimental validation on image and language tasks showing that RA and RVA can speed up convergence for several optimizers (with some caveats like Adam’s performance). Overall, the framework offers a principled path to faster stochastic optimization in non-convex regimes and suggests practical acceleration strategies with predictable gains.

Abstract

as the sum of the stochastic subgradient

and an additional acceleration term

, thus we can discuss the convergence by analyzing

. Through our framework, we have discovered two plug-and-play acceleration methods: \textbf{Reject Accelerating} and \textbf{Random Vector Accelerating}, we theoretically demonstrate that these two methods can directly lead to an improvement in convergence rate.

Paper Structure (53 sections, 13 theorems, 54 equations, 2 figures, 1 table, 2 algorithms)

This paper contains 53 sections, 13 theorems, 54 equations, 2 figures, 1 table, 2 algorithms.

Introduction
Gradient Descent and Stochastic Gradient Descent.
Accelerating the SGD.
A Unified Framework.
Fast-ever SGD with Accelerating.
Related Work
Stochastic Gradient Descent and Applications in Machine Learning.
Stochastic Optimization.
Problem Definition
A Universal Convergence Analysis Framework of Accelerating Algorithms
Consistency between and .
Expectations of and .
In the Case of
In the Case of
Main Results
...and 38 more sections

Key Result

Lemma 4.7

$f: \mathbb{R}^d \rightarrow \mathbb{R}$ is a $L$-smooth function and has $\sigma$-bounded gradients. Denote $T > 0$ as a positive integer, for a stochastic iterative first-order optimization algorithm that implements $T$ times. For $t \in [T]$, we follows Definition def:v:informal to write it as $x

Figures (2)

Figure 1: Experimental results of applying Reject Accelerating to Adam, Lion, and SGDm optimizers on Cifar-10 dataset.
Figure 2: Experimental results of applying Random Vector Accelerating to Adam and SGD optimizers on Cifar-100 and Penn Treebank datasets separately.

Theorems & Definitions (40)

Definition 3.1
Definition 3.3
Definition 3.4: Convergence rate
Definition 4.1: Additional accelerating term $v_t$
Definition 4.2
Definition 4.3
Definition 4.4
Definition 4.5
Definition 4.6
Lemma 4.7: Informal version of Lemma \ref{['lem:case1']}
...and 30 more

Enhancing Stochastic Gradient Descent: A Unified Framework and Novel Acceleration Methods for Faster Convergence

TL;DR

Abstract

Enhancing Stochastic Gradient Descent: A Unified Framework and Novel Acceleration Methods for Faster Convergence

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (40)