Table of Contents
Fetching ...

Randomised Splitting Methods and Stochastic Gradient Descent

Luke Shaw, Peter A. Whalley

TL;DR

The paper reframes stochastic gradient optimisation with minibatching as a splitting-method problem from ODE theory, revealing why momentum and symmetry-enhanced minibatching improve long-run accuracy. By introducing Symmetric Minibatch Splitting (SMS) and interpreting RR and SMS through Strang-like splittings, it proves that momentum combined with SMS achieves stochastic gradient bias that scales as $O(h^4)$ in mean-square error, outperforming existing strategies. A comprehensive backward-error (Lie derivative and BCH) analysis, augmented by Lyapunov convergence guarantees, shows improved bias and convergence rates for SMS with momentum across strongly convex, smooth objectives, with analytical model problems supporting the rates. The numerical experiments on logistic regression validate the theory and demonstrate practical gains, suggesting that SMS with momentum offers a robust, low-cost route to faster convergence in finite-sum optimisation. The work thus provides a principled, mathematically grounded blueprint for designing minibatching strategies that leverage symmetry and momentum to tighten stochastic bias and speed up optimisation in ML contexts.

Abstract

We explore an explicit link between stochastic gradient descent using common batching strategies and splitting methods for ordinary differential equations. From this perspective, we introduce a new minibatching strategy (called Symmetric Minibatching Strategy) for stochastic gradient optimisation which shows greatly reduced stochastic gradient bias (from $\mathcal{O}(h^2)$ to $\mathcal{O}(h^4)$ in the optimiser stepsize $h$), when combined with momentum-based optimisers. We justify why momentum is needed to obtain the improved performance using the theory of backward analysis for splitting integrators and provide a detailed analytic computation of the stochastic gradient bias on a simple example. Further, we provide improved convergence guarantees for this new minibatching strategy using Lyapunov techniques that show reduced stochastic gradient bias for a fixed stepsize (or learning rate) over the class of strongly-convex and smooth objective functions. Via the same techniques we also improve the known results for the Random Reshuffling strategy for stochastic gradient descent methods with momentum. We argue that this also leads to a faster convergence rate when considering a decreasing stepsize schedule. Both the reduced bias and efficacy of decreasing stepsizes are demonstrated numerically on several motivating examples.

Randomised Splitting Methods and Stochastic Gradient Descent

TL;DR

The paper reframes stochastic gradient optimisation with minibatching as a splitting-method problem from ODE theory, revealing why momentum and symmetry-enhanced minibatching improve long-run accuracy. By introducing Symmetric Minibatch Splitting (SMS) and interpreting RR and SMS through Strang-like splittings, it proves that momentum combined with SMS achieves stochastic gradient bias that scales as in mean-square error, outperforming existing strategies. A comprehensive backward-error (Lie derivative and BCH) analysis, augmented by Lyapunov convergence guarantees, shows improved bias and convergence rates for SMS with momentum across strongly convex, smooth objectives, with analytical model problems supporting the rates. The numerical experiments on logistic regression validate the theory and demonstrate practical gains, suggesting that SMS with momentum offers a robust, low-cost route to faster convergence in finite-sum optimisation. The work thus provides a principled, mathematically grounded blueprint for designing minibatching strategies that leverage symmetry and momentum to tighten stochastic bias and speed up optimisation in ML contexts.

Abstract

We explore an explicit link between stochastic gradient descent using common batching strategies and splitting methods for ordinary differential equations. From this perspective, we introduce a new minibatching strategy (called Symmetric Minibatching Strategy) for stochastic gradient optimisation which shows greatly reduced stochastic gradient bias (from to in the optimiser stepsize ), when combined with momentum-based optimisers. We justify why momentum is needed to obtain the improved performance using the theory of backward analysis for splitting integrators and provide a detailed analytic computation of the stochastic gradient bias on a simple example. Further, we provide improved convergence guarantees for this new minibatching strategy using Lyapunov techniques that show reduced stochastic gradient bias for a fixed stepsize (or learning rate) over the class of strongly-convex and smooth objective functions. Via the same techniques we also improve the known results for the Random Reshuffling strategy for stochastic gradient descent methods with momentum. We argue that this also leads to a faster convergence rate when considering a decreasing stepsize schedule. Both the reduced bias and efficacy of decreasing stepsizes are demonstrated numerically on several motivating examples.

Paper Structure

This paper contains 34 sections, 8 theorems, 124 equations, 3 figures, 1 algorithm.

Key Result

Theorem 1.4

Let $F$ satisfy assum:smoothness with minimiser $X_{*} \in \mathbb{R}^{d}$. For an initialisation $x_{0} \in \mathbb{R}^{d}$, stepsize $0<h<1/L$ and $K\in\mathbb{N}$, the iterates of eq:GD satisfy

Figures (3)

  • Figure 5.1: An experiment for the model problem \ref{['eq:ModelProblem']} with $\sigma_i$ for $i = 1,...,N$ not constant shows that the bias in the RMSE is no longer $\mathcal{O}(h^{5/2})$ but rather $\mathcal{O}(h^{2})$. We take $N=5$ and set $\boldsymbol{\sigma}^2=[\sigma_1^2,\ldots,\sigma_N^2]$, taking $\boldsymbol{\sigma}^2=[2.5,1.5,0.05,0.15,0.1]$ with $x_i=i,i=1,\ldots 5$, and use the Euler method to solve the resulting system \ref{['eq:Damped']}.
  • Figure 7.1: The error norm $\|x-X_*\|$ is the RMSE (in the Euclidean 2-norm) over 100 independent stochastic gradient realisations. Note that the final batch in each epoch is of a different size to the other batches for all the datasets except SimData, and that no reduction of order of the bias is observed (as would be the case if one had not reweighted the gradients correctly as described in \ref{['rem:varRed']}).
  • Figure 7.2: $\delta=1/3$ for SimData, $\delta=1/4$ for CTG, $\delta=1/6$ for StatLog and $\delta=1/7$ for Chess.

Theorems & Definitions (30)

  • Definition 1.1
  • Definition 1.2
  • Theorem 1.4
  • proof
  • Theorem 1.6: SGD-RM
  • proof
  • Example 4.1: Gradient Descent
  • Example 4.3: SGD-RM
  • Remark 4.4
  • Example 4.5: SGD-RR
  • ...and 20 more