Towards Continuous-Time Approximations for Stochastic Gradient Descent without Replacement
Stefan Perko
TL;DR
This work presents a continuous-time stochastic framework for SGD without replacement by driving a Young differential equation with epoched Brownian motion to capture epoch-wise data reuse. It proves almost-sure convergence for strongly convex objectives with Hölder Hessians under decaying learning rates $u_t = \frac{1}{(1+ct)^\beta}$ with $\beta\in(0,1)$, and derives explicit rate bounds that are competitive with or improve upon existing SGDo results across various shuffling schemes. The epoched noise model unifies single-shuffle, random reshuffling, and related schemes, linking SGDo dynamics to stochastic-ODE approximations while clarifying the role of epoch structure in convergence. Overall, the paper provides a principled, mathematically tractable lens for analyzing epoch-based SGD methods and their convergence behavior.
Abstract
Gradient optimization algorithms using epochs, that is those based on stochastic gradient descent without replacement (SGDo), are predominantly used to train machine learning models in practice. However, the mathematical theory of SGDo and related algorithms remain underexplored compared to their "with replacement" and "one-pass" counterparts. In this article, we propose a stochastic, continuous-time approximation to SGDo with additive noise based on a Young differential equation driven by a stochastic process we call an "epoched Brownian motion". We show its usefulness by proving the almost sure convergence of the continuous-time approximation for strongly convex objectives and learning rate schedules of the form $u_t = \frac{1}{(1+t)^β}, β\in (0,1)$. Moreover, we compute an upper bound on the asymptotic rate of almost sure convergence, which is as good or better than previous results for SGDo.
