Table of Contents
Fetching ...

Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks

Parsa Rangriz

TL;DR

This work addresses the behavior of online stochastic gradient descent for high-dimensional, single-layer networks by developing a diffusion-limit framework with localizability and asymptotic closability. It identifies a critical step-size regime $\delta_N=1/N$ in which a correction term appears, causing the effective dynamics to deviate from the deterministic gradient-flow described by DMFT. The authors prove an ODE limit for the summary statistics $u_N=(m,r_⊥^2)$ with a population drift $\mathcal{F}$ and corrector $\mathcal{G}$, and, in a microlocal regime near fixed points, establish a limiting SDE that reduces to an Ornstein–Uhlenbeck process under suitable conditions. For activation functions with information exponent $k>2$, Gaussian initialization leads to $m(t)=0$ and a fixed-point radius $r_⊥^*$, illustrating the key role of stochastic fluctuations in high-dimensional learning and clarifying the limitations of deterministic ballistic scaling in capturing the full dynamics.

Abstract

This paper studies the high-dimensional scaling limits of online stochastic gradient descent (SGD) for single-layer networks. Building on the seminal work of Saad and Solla, which analyzed the deterministic (ballistic) scaling limits of SGD corresponding to the gradient flow of the population loss, we focus on the critical scaling regime of the step size. Below this critical scale, the effective dynamics are governed by ballistic (ODE) limits, but at the critical scale, new correction term appears that changes the phase diagram. In this regime, near the fixed points, the corresponding diffusive (SDE) limits of the effective dynamics reduces to an Ornstein-Uhlenbeck process under certain conditions. These results highlight how the information exponent controls sample complexity and illustrates the limitations of deterministic scaling limit in capturing the stochastic fluctuations of high-dimensional learning dynamics.

Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks

TL;DR

This work addresses the behavior of online stochastic gradient descent for high-dimensional, single-layer networks by developing a diffusion-limit framework with localizability and asymptotic closability. It identifies a critical step-size regime in which a correction term appears, causing the effective dynamics to deviate from the deterministic gradient-flow described by DMFT. The authors prove an ODE limit for the summary statistics with a population drift and corrector , and, in a microlocal regime near fixed points, establish a limiting SDE that reduces to an Ornstein–Uhlenbeck process under suitable conditions. For activation functions with information exponent , Gaussian initialization leads to and a fixed-point radius , illustrating the key role of stochastic fluctuations in high-dimensional learning and clarifying the limitations of deterministic ballistic scaling in capturing the full dynamics.

Abstract

This paper studies the high-dimensional scaling limits of online stochastic gradient descent (SGD) for single-layer networks. Building on the seminal work of Saad and Solla, which analyzed the deterministic (ballistic) scaling limits of SGD corresponding to the gradient flow of the population loss, we focus on the critical scaling regime of the step size. Below this critical scale, the effective dynamics are governed by ballistic (ODE) limits, but at the critical scale, new correction term appears that changes the phase diagram. In this regime, near the fixed points, the corresponding diffusive (SDE) limits of the effective dynamics reduces to an Ornstein-Uhlenbeck process under certain conditions. These results highlight how the information exponent controls sample complexity and illustrates the limitations of deterministic scaling limit in capturing the stochastic fluctuations of high-dimensional learning dynamics.

Paper Structure

This paper contains 7 sections, 6 theorems, 57 equations.

Key Result

Theorem 2.4

Let $(X_k^{\delta_N})_k$ be SGD initialized from $X_0 \sim \mu_N$ for $\mu_N \in \mathcal{M}_1(\mathbb R^{N})$ with learning rate $\delta_N$ for the quadratic loss $L$ of a single-index model and the information exponent of the activation function is at least two. For the corresponding summary stati

Theorems & Definitions (15)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Theorem 2.4
  • Corollary 2.5
  • Theorem 2.6
  • Corollary 2.7
  • Lemma 3.1
  • proof
  • proof : Proof of Theorem \ref{['ode']}
  • ...and 5 more