Table of Contents
Fetching ...

High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes

Aukosh Jagannath, Taj Jones-McCormick, Varnan Sarangian

TL;DR

This work develops a rigorous high-dimensional scaling framework for stochastic gradient descent with Polyak momentum (SGD-M) and adaptive step-sizes, establishing diffusion limits for summary statistics and showing equivalence to online SGD under a time-rescaling. By verifying delta_n-localizability and delta_n-closability, the authors derive limiting SDEs with drift $\\boldsymbol{h}(\beta,\\boldsymbol{u})$ and diffusion $\\boldsymbol{\\Sigma}(\\boldsymbol{u})$, and extend the analysis to adaptive preconditioners. The framework is applied to Spiked Tensor PCA and Single Index Models, where SGD-U (unit-gradient preconditioning) yields fixed points closer to the population minimum and tolerates larger step-sizes, illustrating how early preconditioning stabilizes high-dimensional dynamics compared to online SGD. The results unify fixed- and high-dimensional SGD analyses, provide precise critical thresholds for phase transitions in learning, and offer a rigorous basis for preconditioning strategies that mitigate exploding/vanishing gradient phenomena in high dimensions. Overall, the paper justifies and quantifies the empirical advantage of momentum and gradient normalization in large-scale, high-dimensional learning tasks.

Abstract

We develop a high-dimensional scaling limit for Stochastic Gradient Descent with Polyak Momentum (SGD-M) and adaptive step-sizes. This provides a framework to rigourously compare online SGD with some of its popular variants. We show that the scaling limits of SGD-M coincide with those of online SGD after an appropriate time rescaling and a specific choice of step-size. However, if the step-size is kept the same between the two algorithms, SGD-M will amplify high-dimensional effects, potentially degrading performance relative to online SGD. We demonstrate our framework on two popular learning problems: Spiked Tensor PCA and Single Index Models. In both cases, we also examine online SGD with an adaptive step-size based on normalized gradients. In the high-dimensional regime, this algorithm yields multiple benefits: its dynamics admit fixed points closer to the population minimum and widens the range of admissible step-sizes for which the iterates converge to such solutions. These examples provide a rigorous account, aligning with empirical motivation, of how early preconditioners can stabilize and improve dynamics in settings where online SGD fails.

High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes

TL;DR

This work develops a rigorous high-dimensional scaling framework for stochastic gradient descent with Polyak momentum (SGD-M) and adaptive step-sizes, establishing diffusion limits for summary statistics and showing equivalence to online SGD under a time-rescaling. By verifying delta_n-localizability and delta_n-closability, the authors derive limiting SDEs with drift and diffusion , and extend the analysis to adaptive preconditioners. The framework is applied to Spiked Tensor PCA and Single Index Models, where SGD-U (unit-gradient preconditioning) yields fixed points closer to the population minimum and tolerates larger step-sizes, illustrating how early preconditioning stabilizes high-dimensional dynamics compared to online SGD. The results unify fixed- and high-dimensional SGD analyses, provide precise critical thresholds for phase transitions in learning, and offer a rigorous basis for preconditioning strategies that mitigate exploding/vanishing gradient phenomena in high dimensions. Overall, the paper justifies and quantifies the empirical advantage of momentum and gradient normalization in large-scale, high-dimensional learning tasks.

Abstract

We develop a high-dimensional scaling limit for Stochastic Gradient Descent with Polyak Momentum (SGD-M) and adaptive step-sizes. This provides a framework to rigourously compare online SGD with some of its popular variants. We show that the scaling limits of SGD-M coincide with those of online SGD after an appropriate time rescaling and a specific choice of step-size. However, if the step-size is kept the same between the two algorithms, SGD-M will amplify high-dimensional effects, potentially degrading performance relative to online SGD. We demonstrate our framework on two popular learning problems: Spiked Tensor PCA and Single Index Models. In both cases, we also examine online SGD with an adaptive step-size based on normalized gradients. In the high-dimensional regime, this algorithm yields multiple benefits: its dynamics admit fixed points closer to the population minimum and widens the range of admissible step-sizes for which the iterates converge to such solutions. These examples provide a rigorous account, aligning with empirical motivation, of how early preconditioners can stabilize and improve dynamics in settings where online SGD fails.

Paper Structure

This paper contains 30 sections, 32 theorems, 264 equations, 3 figures.

Key Result

Theorem 2.3

Let $(X_{\ell}^{\delta_n} )_{\ell}$ be SGD initialized from $X_0 \sim \mu_n$ for $\mu_n \in \mathscr{M}_1\left(\mathbb{R}^{p_n}\right)$ with learning rate $\delta_n$ and fixed momentum parameter $\beta \in [0, 1 )$ for the loss $L_n(\cdot, \cdot)$ and data distribution $P_n$. For a family of summary initialized from $\nu$, where $\mathbf{B}_t$ is a standard Brownian motion in $\mathbb{R}^k$.

Figures (3)

  • Figure 1: Matrix PCA in dimension $n = 10000$ for $\lambda = 0.8$ (left figure), $\lambda = 1.2$ (middle figure), and $\lambda = 2.2$ (right figure) with $c_\delta = 1$. Depicted is the evolution of the statistic $|m(t) / R(t)|$ for $20n$ steps with random initialization. We note that in the left figure, only SGD-U is supercritical. In the middle figure, both SGD-U and online SGD are supercritical, however the former attains better alignment with the direction vector $v$. The rightmost figure shows only SGD-M with $\beta = 0.9$ as subcritical, however the alignment order is consistent.
  • Figure 2: Matrix PCA in dimension $n = 10000$ for $\lambda = 0.8$ (left figure), $\lambda = 1.2$ (middle figure), and $\lambda = 2.2$ (right figure) with $c_\delta = 1$. Depicted is the evolution of the rescaled statistic $\tilde{u}_1 = \sqrt{n} \, m(t)$ for $6n$ steps around a fixed window about $m = 0$. We note the increased volatility of the diffusive limits for SGD-M as $\beta$ increases. We see that SGD-U becomes mean repellent for smaller values of $\lambda$, similar to Figure \ref{['fig:matrix-pca-ballistic-sims']}.
  • Figure 3: We plot the value of $|m/R|$ over the course of training for independent runs of SGD (full lines) and SGD-U (dashed lines) for various functions $f$ with different amounts of additive noise. We set $n=10,000$, $\delta =c_\delta/n$ and the total number of steps is taken as one million. We consider $f(x)=x^2$, $f(x) = x^3$ and $f(x)=x^7 +4x^4$. In each case we choose $c_\delta = 10^{-k}$ for the smallest integer k such that the dynamics are not effected by exploding gradients.

Theorems & Definitions (65)

  • Definition 2.1
  • Definition 2.2
  • Theorem 2.3
  • Remark 2.4
  • Proposition 3.1
  • Proposition 3.2: Fixed Points - SGD-M
  • Proposition 3.3: Fixed Points - SGD-U
  • Proposition 3.4
  • Proposition 3.5
  • Proposition 3.6
  • ...and 55 more