Table of Contents
Fetching ...

Functional Central Limit Theorem for Stochastic Gradient Descent

Kessang Flamand, Victor-Emmanuel Brunel

TL;DR

This work addresses understanding the full asymptotic behavior of SGD trajectories for convex objectives, not merely the endpoint. It introduces a functional central limit theorem by analyzing a rescaled SGD path with step size $t_n = \delta/n$, proving convergence to a diffusion process $\{Y_t\}$ on $(0,\infty)$ with SDE $dY_t = -t^{-1}H Y_t dt + \Sigma^{1/2} dB_t$, where $H = \delta\nabla^2\Phi(\theta^*) - I_d$ and $\Sigma = \delta^2\Gamma$. This yields Gaussian fluctuations of the trajectory and, in particular, $\sqrt{n}(\hat{\theta}_n - \theta^*) \Rightarrow N(0,\Sigma)$, providing a trajectory-level counterpart to classical CLTs and demonstrating applicability to non-smooth robust objectives like the geometric median. The results highlight a diffusion-based portrait of long-term SGD behavior under mild convexity assumptions, while noting limitations such as the need to know the local curvature to set $\delta$ and that standard SGD variance may be larger than ERM benchmarks; future work includes extending to averaging schemes (e.g., Polyak-Ruppert) for asymptotic efficiency.

Abstract

We study the asymptotic shape of the trajectory of the stochastic gradient descent algorithm applied to a convex objective function. Under mild regularity assumptions, we prove a functional central limit theorem for the properly rescaled trajectory. Our result characterizes the long-term fluctuations of the algorithm around the minimizer by providing a diffusion limit for the trajectory. In contrast with classical central limit theorems for the last iterate or Polyak-Ruppert averages, this functional result captures the temporal structure of the fluctuations and applies to non-smooth settings such as robust location estimation, including the geometric median.

Functional Central Limit Theorem for Stochastic Gradient Descent

TL;DR

This work addresses understanding the full asymptotic behavior of SGD trajectories for convex objectives, not merely the endpoint. It introduces a functional central limit theorem by analyzing a rescaled SGD path with step size , proving convergence to a diffusion process on with SDE , where and . This yields Gaussian fluctuations of the trajectory and, in particular, , providing a trajectory-level counterpart to classical CLTs and demonstrating applicability to non-smooth robust objectives like the geometric median. The results highlight a diffusion-based portrait of long-term SGD behavior under mild convexity assumptions, while noting limitations such as the need to know the local curvature to set and that standard SGD variance may be larger than ERM benchmarks; future work includes extending to averaging schemes (e.g., Polyak-Ruppert) for asymptotic efficiency.

Abstract

We study the asymptotic shape of the trajectory of the stochastic gradient descent algorithm applied to a convex objective function. Under mild regularity assumptions, we prove a functional central limit theorem for the properly rescaled trajectory. Our result characterizes the long-term fluctuations of the algorithm around the minimizer by providing a diffusion limit for the trajectory. In contrast with classical central limit theorems for the last iterate or Polyak-Ruppert averages, this functional result captures the temporal structure of the fluctuations and applies to non-smooth settings such as robust location estimation, including the geometric median.
Paper Structure (14 sections, 16 theorems, 71 equations, 1 figure)

This paper contains 14 sections, 16 theorems, 71 equations, 1 figure.

Key Result

Theorem 1

Let the sequence of step-sizes $(t_n)_{n\geq 1}$ satisfy $\sum_{n\geq 1}t_n=\infty$ and $\sum_{n\geq 1}t_n^2<\infty$. Let Assumptions assump:Phi-a, assump:quadratic_growth and assump:noise1 hold. Then, $\theta_n\xrightarrow[n\to\infty]{} \theta^*$ almost surely.

Figures (1)

  • Figure 1: Stochastic gradient descent trajectory for the estimation of the median of a Laplace$(0,1)$ distribution in $\mathbb{R}^2$ with independent coordinates, based on $n=50000$ samples. The first panel shows the full trajectory. The second panel zooms in on the fluctuations around the minimizer, with the first $2000$ iterations removed. The third panel shows the rescaled trajectory, illustrating the diffusion-like behavior predicted by Theorem \ref{['thm:main']}. Color indicates time, from light (start) to dark (end). We set the step size to $2/k$.

Theorems & Definitions (28)

  • Theorem 1
  • proof
  • Theorem 2
  • Proposition 1
  • Theorem 3
  • Remark 1
  • Proposition 2
  • proof
  • proof : Proof of Theorem \ref{['thm:main']}
  • Lemma 1
  • ...and 18 more