Enhancing Stochastic Gradient Descent: A Unified Framework and Novel Acceleration Methods for Faster Convergence
Yichuan Deng, Zhao Song, Chiwun Yang
TL;DR
This work addresses accelerating stochastic gradient methods in non-convex settings by introducing a unified first-order framework that decomposes the update direction into a gradient term and an acceleration term $v_t$, with adaptive scaling $\eta_t = \frac{2|\langle v_t, \nabla f_t(x_t) \rangle|}{\|v_t\|_2^2}$. It proves a general convergence bound $\min_{t\in[T]} \mathbb{E}[\|\nabla f(x_t)\|_2^2] \le \frac{\sqrt{T+8kB}}{T+2ku_a-2lu_b}\sqrt{2(f(x_0)-f(x^*))L\sigma^2}$ and introduces two plug-and-play methods, Reject Accelerating and Random Vector Accelerating, that can tighten the bound by adjusting $k,l,u_a,u_b,B$ or by exploiting Gaussian directions. The authors provide formal lemmas and a main theorem, plus experimental validation on image and language tasks showing that RA and RVA can speed up convergence for several optimizers (with some caveats like Adam’s performance). Overall, the framework offers a principled path to faster stochastic optimization in non-convex regimes and suggests practical acceleration strategies with predictable gains.
Abstract
Based on SGD, previous works have proposed many algorithms that have improved convergence speed and generalization in stochastic optimization, such as SGDm, AdaGrad, Adam, etc. However, their convergence analysis under non-convex conditions is challenging. In this work, we propose a unified framework to address this issue. For any first-order methods, we interpret the updated direction $g_t$ as the sum of the stochastic subgradient $\nabla f_t(x_t)$ and an additional acceleration term $\frac{2|\langle v_t, \nabla f_t(x_t) \rangle|}{\|v_t\|_2^2} v_t$, thus we can discuss the convergence by analyzing $\langle v_t, \nabla f_t(x_t) \rangle$. Through our framework, we have discovered two plug-and-play acceleration methods: \textbf{Reject Accelerating} and \textbf{Random Vector Accelerating}, we theoretically demonstrate that these two methods can directly lead to an improvement in convergence rate.
