SGD with memory: fundamental properties and stochastic acceleration
Dmitry Yarotsky, Maksim Velikanov
TL;DR
This work develops a general theoretical framework for SGD variants with fixed memory, showing that stationary memory-$M$ methods preserve the basic $t^{-\zeta}$ loss decay from plain GD while allowing the asymptotic constant to be tuned via an effective learning rate. By deriving a propagator-based loss expansion, the authors decompose the dynamics into signal and noise contributions and prove stability conditions, yielding a power-law phase diagram that extends to arbitrary memory sizes. Focusing on memory-1, they demonstrate that the effective learning rate can be driven arbitrarily high under stability constraints, enabling accelerated convergence in the signal-dominated regime; they further propose a time-dependent AM1 schedule that achieves $L_t=O(t^{-\zeta(2-1/\nu)})$ in theory and show heuristic/empirical improvements on synthetic and MNIST-like tasks. The results provide a rigorous and practical pathway to accelerate mini-batch SGD on quadratics with power-law spectra, with potential implications for kernel methods and wide neural networks in the NTK regime.
Abstract
An important open problem is the theoretically feasible acceleration of mini-batch SGD-type algorithms on quadratic problems with power-law spectrum. In the non-stochastic setting, the optimal exponent $ξ$ in the loss convergence $L_t\sim C_Lt^{-ξ}$ is double that in plain GD and is achievable using Heavy Ball (HB) with a suitable schedule; this no longer works in the presence of mini-batch noise. We address this challenge by considering first-order methods with an arbitrary fixed number $M$ of auxiliary velocity vectors (*memory-$M$ algorithms*). We first prove an equivalence between two forms of such algorithms and describe them in terms of suitable characteristic polynomials. Then we develop a general expansion of the loss in terms of signal and noise propagators. Using it, we show that losses of stationary stable memory-$M$ algorithms always retain the exponent $ξ$ of plain GD, but can have different constants $C_L$ depending on their effective learning rate that generalizes that of HB. We prove that in memory-1 algorithms we can make $C_L$ arbitrarily small while maintaining stability. As a consequence, we propose a memory-1 algorithm with a time-dependent schedule that we show heuristically and experimentally to improve the exponent $ξ$ of plain SGD.
