Table of Contents
Fetching ...

Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent

Adrien Schertzer, Loucas Pillaud-Vivien

TL;DR

In both scenarios, this model provides precise, non-asymptotic rates of convergence to the (possibly degenerate) stationary distribution, and describes this asymptotic distribution, offering estimates of its mean, deviations from it, and a proof of the emergence of heavy-tails related to the step-size magnitude.

Abstract

We study the dynamics of a continuous-time model of the Stochastic Gradient Descent (SGD) for the least-square problem. Indeed, pursuing the work of Li et al. (2019), we analyze Stochastic Differential Equations (SDEs) that model SGD either in the case of the training loss (finite samples) or the population one (online setting). A key qualitative feature of the dynamics is the existence of a perfect interpolator of the data, irrespective of the sample size. In both scenarios, we provide precise, non-asymptotic rates of convergence to the (possibly degenerate) stationary distribution. Additionally, we describe this asymptotic distribution, offering estimates of its mean, deviations from it, and a proof of the emergence of heavy-tails related to the step-size magnitude. Numerical simulations supporting our findings are also presented.

Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent

TL;DR

In both scenarios, this model provides precise, non-asymptotic rates of convergence to the (possibly degenerate) stationary distribution, and describes this asymptotic distribution, offering estimates of its mean, deviations from it, and a proof of the emergence of heavy-tails related to the step-size magnitude.

Abstract

We study the dynamics of a continuous-time model of the Stochastic Gradient Descent (SGD) for the least-square problem. Indeed, pursuing the work of Li et al. (2019), we analyze Stochastic Differential Equations (SDEs) that model SGD either in the case of the training loss (finite samples) or the population one (online setting). A key qualitative feature of the dynamics is the existence of a perfect interpolator of the data, irrespective of the sample size. In both scenarios, we provide precise, non-asymptotic rates of convergence to the (possibly degenerate) stationary distribution. Additionally, we describe this asymptotic distribution, offering estimates of its mean, deviations from it, and a proof of the emergence of heavy-tails related to the step-size magnitude. Numerical simulations supporting our findings are also presented.
Paper Structure (46 sections, 13 theorems, 159 equations, 3 figures)

This paper contains 46 sections, 13 theorems, 159 equations, 3 figures.

Key Result

Lemma 4.2

The vector $\theta_*$ is the orthogonal projection of $\theta_0$ into $\mathcal{I}$, that is

Figures (3)

  • Figure 1: Plot showing the error of SGD along time for an overparametrized regime where $n = 100$ and $d = 200$. The samples $(x_i)_{i\leq n}$ come from a Gaussian distribution with a covariance whose eigenvalues decay as a power law. The vertical dotted orange line illustrates the separation between the two regimes depicted by Theorem \ref{['Th:convergConst']}, the polynomial one before (a straight line in a log-log plot) and the exponential line, after typical time scale $1/\mu$. This illustrates perfectly the rates of convergence shown in Theorem \ref{['Th:convergConst']}.
  • Figure 2: Display of a two-dimensional projection of $10$ trajectories of SGD for $n = 100$, $d = 200$, in the case that there is an perfect interpolator $\theta_*$. The ellipses represent the level curves of the training loss.
  • Figure 3: Four plots showing the trajectory of SGD in the noisy setting. The arrow of time goes from top left to bottom right. We see that the two variance reduction methods (time average and decaying step-sizes) converge towards $\theta_*$ (confirming Propositions \ref{['prop:ergodic']} and \ref{['prop:stepsizedecay']}), while plain SGD has a stationary distribution with certain fluctuations around its mean $\theta_*$ as explained in Theorem \ref{['thm:convergence_noisy']} and Proposition \ref{['prop:localization']}. Plain SGD is faster to its invariant distribution as the variance reduction methods as shown in the convergence rates provided in the results.

Theorems & Definitions (33)

  • Remark 2.1: Link with RKHS
  • Example 2.2: Noisy model
  • Example 2.3: Underparametrized setting
  • Example 2.4: Overparametrized setting
  • Definition 4.1
  • Lemma 4.2
  • Theorem 4.3
  • Lemma 4.4
  • proof
  • Lemma 4.5
  • ...and 23 more