Table of Contents
Fetching ...

SGD with memory: fundamental properties and stochastic acceleration

Dmitry Yarotsky, Maksim Velikanov

TL;DR

This work develops a general theoretical framework for SGD variants with fixed memory, showing that stationary memory-$M$ methods preserve the basic $t^{-\zeta}$ loss decay from plain GD while allowing the asymptotic constant to be tuned via an effective learning rate. By deriving a propagator-based loss expansion, the authors decompose the dynamics into signal and noise contributions and prove stability conditions, yielding a power-law phase diagram that extends to arbitrary memory sizes. Focusing on memory-1, they demonstrate that the effective learning rate can be driven arbitrarily high under stability constraints, enabling accelerated convergence in the signal-dominated regime; they further propose a time-dependent AM1 schedule that achieves $L_t=O(t^{-\zeta(2-1/\nu)})$ in theory and show heuristic/empirical improvements on synthetic and MNIST-like tasks. The results provide a rigorous and practical pathway to accelerate mini-batch SGD on quadratics with power-law spectra, with potential implications for kernel methods and wide neural networks in the NTK regime.

Abstract

An important open problem is the theoretically feasible acceleration of mini-batch SGD-type algorithms on quadratic problems with power-law spectrum. In the non-stochastic setting, the optimal exponent $ξ$ in the loss convergence $L_t\sim C_Lt^{-ξ}$ is double that in plain GD and is achievable using Heavy Ball (HB) with a suitable schedule; this no longer works in the presence of mini-batch noise. We address this challenge by considering first-order methods with an arbitrary fixed number $M$ of auxiliary velocity vectors (*memory-$M$ algorithms*). We first prove an equivalence between two forms of such algorithms and describe them in terms of suitable characteristic polynomials. Then we develop a general expansion of the loss in terms of signal and noise propagators. Using it, we show that losses of stationary stable memory-$M$ algorithms always retain the exponent $ξ$ of plain GD, but can have different constants $C_L$ depending on their effective learning rate that generalizes that of HB. We prove that in memory-1 algorithms we can make $C_L$ arbitrarily small while maintaining stability. As a consequence, we propose a memory-1 algorithm with a time-dependent schedule that we show heuristically and experimentally to improve the exponent $ξ$ of plain SGD.

SGD with memory: fundamental properties and stochastic acceleration

TL;DR

This work develops a general theoretical framework for SGD variants with fixed memory, showing that stationary memory- methods preserve the basic loss decay from plain GD while allowing the asymptotic constant to be tuned via an effective learning rate. By deriving a propagator-based loss expansion, the authors decompose the dynamics into signal and noise contributions and prove stability conditions, yielding a power-law phase diagram that extends to arbitrary memory sizes. Focusing on memory-1, they demonstrate that the effective learning rate can be driven arbitrarily high under stability constraints, enabling accelerated convergence in the signal-dominated regime; they further propose a time-dependent AM1 schedule that achieves in theory and show heuristic/empirical improvements on synthetic and MNIST-like tasks. The results provide a rigorous and practical pathway to accelerate mini-batch SGD on quadratics with power-law spectra, with potential implications for kernel methods and wide neural networks in the NTK regime.

Abstract

An important open problem is the theoretically feasible acceleration of mini-batch SGD-type algorithms on quadratic problems with power-law spectrum. In the non-stochastic setting, the optimal exponent in the loss convergence is double that in plain GD and is achievable using Heavy Ball (HB) with a suitable schedule; this no longer works in the presence of mini-batch noise. We address this challenge by considering first-order methods with an arbitrary fixed number of auxiliary velocity vectors (*memory- algorithms*). We first prove an equivalence between two forms of such algorithms and describe them in terms of suitable characteristic polynomials. Then we develop a general expansion of the loss in terms of signal and noise propagators. Using it, we show that losses of stationary stable memory- algorithms always retain the exponent of plain GD, but can have different constants depending on their effective learning rate that generalizes that of HB. We prove that in memory-1 algorithms we can make arbitrarily small while maintaining stability. As a consequence, we propose a memory-1 algorithm with a time-dependent schedule that we show heuristically and experimentally to improve the exponent of plain SGD.
Paper Structure (58 sections, 15 theorems, 167 equations, 5 figures)

This paper contains 58 sections, 15 theorems, 167 equations, 5 figures.

Key Result

Theorem 1

Suppose that the loss is quadratic as in Eq. eq:H.

Figures (5)

  • Figure 1: Divergence of Jacobi accelerated HB vs. stability of our accelerated memory-1 (AM1) method. For both synthetic Gaussian data (left) and MNIST classification with shallow ReLU network (right), Jacobi HB enjoys the accelerated rate $L_t=O(t^{-2\zeta})$ in full-batch setting, but eventually diverges in stochastic one. Increasing batch size only delays the loss explosion. In contrast, our AM1 algorithm, although having weaker acceleration than full-batch Jacobi HB, is stable for small batch sizes and a long training duration. For Gaussian data with an ideal power-law spectrum, dashed lines show theoretical loss power laws with predicted exponents ($t^{-\zeta}$ for SGD and $t^{-\zeta(1+\overline{\alpha})}$ for AM1) and asymptotically match the experimental loss trajectories. Here $\overline{\alpha}>0$ is the schedule exponent of the effective learning rate $\alpha_{\mathrm{eff},t}\approx t^{\overline{\alpha}}$ used by AM1 algorithm. See precise definition of the AM1 schedule and parameters in sec. \ref{['sec:mem1']}. Also, see sec. \ref{['sec:experiments']} for figure details and extra experiments.
  • Figure 2: Left: The phase diagram of stationary SGD with momentum from velikanovview. Theorems \ref{['th:lexp']}, \ref{['th:powasymp']} show that it remains valid for general memory size $M$. Right: Leading terms in the loss expansion \ref{['eq:lossexp']}. In the signal-dominated phase the leading terms have one long signal propagator $V'$ and many short noise propagators $U'$. In the noise-dominated phase the leading terms have one long noise propagator $U'$, while the other propagators $U'$ and the signal propagators $V'$ are short.
  • Figure 3: The geometric position and movement of complex eigenvalues of a memory-1 matrix $S_\lambda$ as $\lambda$ increases from 0 to 1 (see Theorem \ref{['th:m1basic']}, part 2). Left: Classical Heavy Ball ($q_0=0$, polyak1964sometugay1989properties) with $\delta=0.25, \alpha_{\mathrm{eff}}=14$. The circle of non-real eigenvalues is centered at the origin and has radius $r=\sqrt{1-\delta}$. Acceleration ($\alpha_{\mathrm{eff}}\gg 1$) requires $\delta\ll 1$ and hence a large circle, $r\approx 1$. Right: Our generalized memory-1 algorithm with $\delta=0.15, \alpha_{\mathrm{eff}}=4, q_0=1.3$. The accelerated regime of Theorem \ref{['th:acceelrated_stability']} corresponds to small circles close to point 1.
  • Figure 4: Comparison of plain GD and AM1 on a grid of $\overline{\delta},\overline{\alpha}$. We run both algorithms in full-batch setting and mini-batch with $|B|=1$, leading to 4 trajectories on each subplot. Also, we choose $\alpha=0.1$ for both algorithms, and $\alpha_{\mathrm{eff},t}=0.1 t^{\overline{\alpha}}$ to avoid any non-asymptotic divergence. For each couple $\overline{\delta},\overline{\alpha}$, we color in green the subplot titles if stability condition $\overline{\alpha}\le\overline{\delta}(1-\tfrac{1}{\nu})$ holds, and in red otherwise. We observe that all empirically diverged trajectories are predicted correctly by this stability condition, while a single trajectory $(\overline{\delta},\overline{\alpha})=(1.0,0.75)$ is expected to diverge but demonstrate only large but bounded noise fluctuations. Finally, along each experimental trajectory, we plot an exact power-law (dashed) line with rate: $t^{-\zeta}$ for GD and $t^{-\zeta(1+\overline{\alpha})}$ for AM1. Again, we observe very good agreement between theoretical prediction and empirical rates, validating adiabatic approximation of sec. \ref{['sec:pl_alg_convergence']}. Moreover, the rate $t^{-\zeta(1+\overline{\alpha})}$ remains valid even for the case $\overline{\delta}=1$ not covered by adiabatic approximation.
  • Figure 5: Trajectories of plain GD, Jacobi scheduled HB and AM1 for different learning rates and batch sizes. For each subplot, learning rate $\alpha$ of GD and Jacobi-scheduled HB is equal to "kick" learning rate $\alpha_1$ of AM1. Also, Jacobi-scheduled HB and AM1 have exactly the same values of momentum $\beta_t=1-\tfrac{2}{t}$ on each iteration. We observe that AM1 is indeed much more stable than Jacobi-scheduled HB for a range of learning rates and batch sizes. Moreover, it is stable for larger values of $\overline{\alpha}$ than estimated maximal value $\overline{\alpha}_\mathrm{max}\approx0.25$. Another interesting observation is that the divergent trajectories of Jacobi scheduled HB, unlike for pure quadratic problems, only get stuck around the loss of random prediction instead of diverging to infinity.

Theorems & Definitions (21)

  • Theorem 1: \ref{['sec:th:equiv:proof']}
  • Proposition 1: \ref{['th:equivnonstatproof']}
  • Proposition 2: \ref{['sec:se']}
  • Theorem 2: \ref{['sec:lossexpproof']}
  • Proposition 3
  • Theorem 3: \ref{['sec:lexpproof']}
  • Theorem 4: \ref{['th:stabproof']}
  • Theorem 5: \ref{['th:powasympproof']}
  • Theorem 6: \ref{['th:m1basicproof']}
  • Theorem 7: \ref{['sec:accelerated_stability_proof']}
  • ...and 11 more