High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes
Aukosh Jagannath, Taj Jones-McCormick, Varnan Sarangian
TL;DR
This work develops a rigorous high-dimensional scaling framework for stochastic gradient descent with Polyak momentum (SGD-M) and adaptive step-sizes, establishing diffusion limits for summary statistics and showing equivalence to online SGD under a time-rescaling. By verifying delta_n-localizability and delta_n-closability, the authors derive limiting SDEs with drift $\\boldsymbol{h}(\beta,\\boldsymbol{u})$ and diffusion $\\boldsymbol{\\Sigma}(\\boldsymbol{u})$, and extend the analysis to adaptive preconditioners. The framework is applied to Spiked Tensor PCA and Single Index Models, where SGD-U (unit-gradient preconditioning) yields fixed points closer to the population minimum and tolerates larger step-sizes, illustrating how early preconditioning stabilizes high-dimensional dynamics compared to online SGD. The results unify fixed- and high-dimensional SGD analyses, provide precise critical thresholds for phase transitions in learning, and offer a rigorous basis for preconditioning strategies that mitigate exploding/vanishing gradient phenomena in high dimensions. Overall, the paper justifies and quantifies the empirical advantage of momentum and gradient normalization in large-scale, high-dimensional learning tasks.
Abstract
We develop a high-dimensional scaling limit for Stochastic Gradient Descent with Polyak Momentum (SGD-M) and adaptive step-sizes. This provides a framework to rigourously compare online SGD with some of its popular variants. We show that the scaling limits of SGD-M coincide with those of online SGD after an appropriate time rescaling and a specific choice of step-size. However, if the step-size is kept the same between the two algorithms, SGD-M will amplify high-dimensional effects, potentially degrading performance relative to online SGD. We demonstrate our framework on two popular learning problems: Spiked Tensor PCA and Single Index Models. In both cases, we also examine online SGD with an adaptive step-size based on normalized gradients. In the high-dimensional regime, this algorithm yields multiple benefits: its dynamics admit fixed points closer to the population minimum and widens the range of admissible step-sizes for which the iterates converge to such solutions. These examples provide a rigorous account, aligning with empirical motivation, of how early preconditioners can stabilize and improve dynamics in settings where online SGD fails.
