Modified Equations for Stochastic Optimization
Stefan Perko
TL;DR
The thesis extends the theory of stochastic modified equations (SMEs) to stochastic gradient optimization, bridging numerical ODE methods with stochastic analysis to study SGD and its variants. It develops time-inhomogeneous SDEs with step-size expansions, establishing first- and second-order weak approximation properties and explicit linear error terms, then instantiates these results for SGD with linear regression as a key example. A novel diffusion model for SGD without replacement (SGDo) is introduced via epoched Brownian motion (EBM) and Young differential equations, with almost sure convergence in the strongly convex case and sharp convergence-rate bounds; a weak-convergence theory links EBMs to scaling limits governed by deterministic permutons. The work further compares first-order SMEs (gradient flow, NCC-SGF, CC-SGF) in linear regression, deriving explicit linear errors that depend on batch size and kurtosis, and confirms the predictions with numerical experiments. Collectively, the results provide a rigorous toolkit for predicting SGD behavior under finite data regimes and for guiding the choice of SME approximations in practice, with implications for hyperparameter tuning and convergence analysis in large-scale learning systems.
Abstract
In this thesis, we extend the recently introduced theory of stochastic modified equations (SMEs) for stochastic gradient optimization algorithms. In Ch. 3 we study time-inhomogeneous SDEs driven by Brownian motion. For certain SDEs we prove a 1st and 2nd-order weak approximation properties, and we compute their linear error terms explicitly, under certain regularity conditions. In Ch. 4 we instantiate our results for SGD, working out the example of linear regression explicitly. We use this example to compare the linear error terms of gradient flow and two commonly used 1st-order SMEs for SGD in Ch. 5. In the second part of the thesis we introduce and study a novel diffusion approximation for SGD without replacement (SGDo) in the finite-data setting. In Ch. 6 we motivate and define the notion of an epoched Brownian motion (EBM). We argue that Young differential equations (YDEs) driven by EBMs serve as continuous-time models for SGDo for any shuffling scheme whose induced permutations converge to a det. permuton. Further, we prove a.s. convergence for these YDEs in the strongly convex setting. Moreover, we compute an upper asymptotic bound on the convergence rate which is as sharp as, or better than previous results for SGDo. In Ch. 7 we study scaling limits of families of random walks (RW) that share the same increments up to a random permutation. We show weak convergence under the assumption that the sequence of permutations converges to a det. (higher-dimensional) permuton. This permuton determines the covariance function of the limiting Gaussian process. Conversely, we show that every Gaussian process with a covariance function determined by a permuton in this way arises as a weak scaling limit of families of RW with shared increments. Finally, we apply our weak convergence theory to show that EBMs arise as scaling limits of RW with finitely many distinct increments.
