Table of Contents
Fetching ...

Towards Continuous-Time Approximations for Stochastic Gradient Descent without Replacement

Stefan Perko

TL;DR

This work presents a continuous-time stochastic framework for SGD without replacement by driving a Young differential equation with epoched Brownian motion to capture epoch-wise data reuse. It proves almost-sure convergence for strongly convex objectives with Hölder Hessians under decaying learning rates $u_t = \frac{1}{(1+ct)^\beta}$ with $\beta\in(0,1)$, and derives explicit rate bounds that are competitive with or improve upon existing SGDo results across various shuffling schemes. The epoched noise model unifies single-shuffle, random reshuffling, and related schemes, linking SGDo dynamics to stochastic-ODE approximations while clarifying the role of epoch structure in convergence. Overall, the paper provides a principled, mathematically tractable lens for analyzing epoch-based SGD methods and their convergence behavior.

Abstract

Gradient optimization algorithms using epochs, that is those based on stochastic gradient descent without replacement (SGDo), are predominantly used to train machine learning models in practice. However, the mathematical theory of SGDo and related algorithms remain underexplored compared to their "with replacement" and "one-pass" counterparts. In this article, we propose a stochastic, continuous-time approximation to SGDo with additive noise based on a Young differential equation driven by a stochastic process we call an "epoched Brownian motion". We show its usefulness by proving the almost sure convergence of the continuous-time approximation for strongly convex objectives and learning rate schedules of the form $u_t = \frac{1}{(1+t)^β}, β\in (0,1)$. Moreover, we compute an upper bound on the asymptotic rate of almost sure convergence, which is as good or better than previous results for SGDo.

Towards Continuous-Time Approximations for Stochastic Gradient Descent without Replacement

TL;DR

This work presents a continuous-time stochastic framework for SGD without replacement by driving a Young differential equation with epoched Brownian motion to capture epoch-wise data reuse. It proves almost-sure convergence for strongly convex objectives with Hölder Hessians under decaying learning rates with , and derives explicit rate bounds that are competitive with or improve upon existing SGDo results across various shuffling schemes. The epoched noise model unifies single-shuffle, random reshuffling, and related schemes, linking SGDo dynamics to stochastic-ODE approximations while clarifying the role of epoch structure in convergence. Overall, the paper provides a principled, mathematically tractable lens for analyzing epoch-based SGD methods and their convergence behavior.

Abstract

Gradient optimization algorithms using epochs, that is those based on stochastic gradient descent without replacement (SGDo), are predominantly used to train machine learning models in practice. However, the mathematical theory of SGDo and related algorithms remain underexplored compared to their "with replacement" and "one-pass" counterparts. In this article, we propose a stochastic, continuous-time approximation to SGDo with additive noise based on a Young differential equation driven by a stochastic process we call an "epoched Brownian motion". We show its usefulness by proving the almost sure convergence of the continuous-time approximation for strongly convex objectives and learning rate schedules of the form . Moreover, we compute an upper bound on the asymptotic rate of almost sure convergence, which is as good or better than previous results for SGDo.

Paper Structure

This paper contains 15 sections, 22 theorems, 185 equations.

Key Result

Theorem 2

Let $\beta \in (0,1)$, $c > 0$, $L, \lambda > 0$ and $\mathcal{R} : \mathbb{R}^d\to \mathbb{R} \in \mathcal{C}^{2}$ be $\lambda$-strongly convex and $L$-smooth such that $\nabla^2 \mathcal{R}$ is Hölder continuous. Let $Y$ be the solution to the Young differential equation driven by an epoched Brownian motion $\hat{W}$ with period $T$. Then

Theorems & Definitions (23)

  • Definition 1
  • Theorem 2
  • Theorem 3
  • Lemma 4
  • Lemma 5: Borell-TIS
  • Lemma 6
  • Lemma 7
  • Lemma 8
  • Proposition 9
  • Proposition 10: Young-Lóeve
  • ...and 13 more