Table of Contents
Fetching ...

Statistical Guarantees for High-Dimensional Stochastic Gradient Descent

Jiaqi Li, Zhipeng Lou, Johannes Schmidt-Hieber, Wei Biao Wu

TL;DR

This work develops a rigorous theory for constant-step stochastic gradient descent in high dimensions by recasting SGD as a high-dimensional nonlinear time series. It establishes geometric-moment contraction to a stationary distribution, yielding non-asymptotic $q$-th moment bounds in general $\ell^s$ norms and sharp high-probability concentration via a Fuk–Nagaev-type inequality. The authors also analyze Ruppert–Polyak averaged SGD (ASGD), deriving a detailed $\ell^{\infty}$-norm bound, an explicit complexity result, and a Gaussian-approximation for the stationary ASGD, with dimension-dependent learning-rate guidance. Collectively, these results close a key gap in SGD theory for constant learning rates in large-scale, overparameterized models and provide a versatile toolkit for analyzing a broad class of high-dimensional online learning algorithms.

Abstract

Stochastic Gradient Descent (SGD) and its Ruppert-Polyak averaged variant (ASGD) lie at the heart of modern large-scale learning, yet their theoretical properties in high-dimensional settings are rarely understood. In this paper, we provide rigorous statistical guarantees for constant learning-rate SGD and ASGD in high-dimensional regimes. Our key innovation is to transfer powerful tools from high-dimensional time series to online learning. Specifically, by viewing SGD as a nonlinear autoregressive process and adapting existing coupling techniques, we prove the geometric-moment contraction of high-dimensional SGD for constant learning rates, thereby establishing asymptotic stationarity of the iterates. Building on this, we derive the $q$-th moment convergence of SGD and ASGD for any $q\ge2$ in general $\ell^s$-norms, and, in particular, the $\ell^{\infty}$-norm that is frequently adopted in high-dimensional sparse or structured models. Furthermore, we provide sharp high-probability concentration analysis which entails the probabilistic bound of high-dimensional ASGD. Beyond closing a critical gap in SGD theory, our proposed framework offers a novel toolkit for analyzing a broad class of high-dimensional learning algorithms.

Statistical Guarantees for High-Dimensional Stochastic Gradient Descent

TL;DR

This work develops a rigorous theory for constant-step stochastic gradient descent in high dimensions by recasting SGD as a high-dimensional nonlinear time series. It establishes geometric-moment contraction to a stationary distribution, yielding non-asymptotic -th moment bounds in general norms and sharp high-probability concentration via a Fuk–Nagaev-type inequality. The authors also analyze Ruppert–Polyak averaged SGD (ASGD), deriving a detailed -norm bound, an explicit complexity result, and a Gaussian-approximation for the stationary ASGD, with dimension-dependent learning-rate guidance. Collectively, these results close a key gap in SGD theory for constant learning rates in large-scale, overparameterized models and provide a versatile toolkit for analyzing a broad class of high-dimensional online learning algorithms.

Abstract

Stochastic Gradient Descent (SGD) and its Ruppert-Polyak averaged variant (ASGD) lie at the heart of modern large-scale learning, yet their theoretical properties in high-dimensional settings are rarely understood. In this paper, we provide rigorous statistical guarantees for constant learning-rate SGD and ASGD in high-dimensional regimes. Our key innovation is to transfer powerful tools from high-dimensional time series to online learning. Specifically, by viewing SGD as a nonlinear autoregressive process and adapting existing coupling techniques, we prove the geometric-moment contraction of high-dimensional SGD for constant learning rates, thereby establishing asymptotic stationarity of the iterates. Building on this, we derive the -th moment convergence of SGD and ASGD for any in general -norms, and, in particular, the -norm that is frequently adopted in high-dimensional sparse or structured models. Furthermore, we provide sharp high-probability concentration analysis which entails the probabilistic bound of high-dimensional ASGD. Beyond closing a critical gap in SGD theory, our proposed framework offers a novel toolkit for analyzing a broad class of high-dimensional learning algorithms.

Paper Structure

This paper contains 24 sections, 22 theorems, 180 equations, 1 table.

Key Result

Theorem 1

Suppose that Assumptions asm_coercive--asm_Ls_lip hold for some $\mu>0$, $q\ge2$ and even integer $s\ge2$. Given a constant learning rate for any two $d$-dimensional SGD sequences $\{\boldsymbol \beta_k(\alpha)\}_{k\in\mathbb N}$ and $\{\boldsymbol \beta_k'(\alpha)\}_{k\in\mathbb N}$ sharing the same i.i.d. noise injections $\{\bm{\xi}_k\}_{k\ge1}$ but possibly different initializations $\boldsym

Theorems & Definitions (39)

  • Theorem 1: Convergence of SGD to stationary distribution
  • Proposition 1
  • Theorem 2: Moment convergence of SGD
  • Theorem 3
  • Proposition 2: Complexity bound
  • Theorem 4: Fuk-Nagaev inequality
  • Theorem 5: Gaussian approximation
  • Lemma 1
  • Lemma 2: High-dimensional moment inequality
  • Theorem 6: Asymptotic stationarity
  • ...and 29 more