Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent

Adrien Schertzer; Loucas Pillaud-Vivien

Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent

Adrien Schertzer, Loucas Pillaud-Vivien

TL;DR

In both scenarios, this model provides precise, non-asymptotic rates of convergence to the (possibly degenerate) stationary distribution, and describes this asymptotic distribution, offering estimates of its mean, deviations from it, and a proof of the emergence of heavy-tails related to the step-size magnitude.

Abstract

We study the dynamics of a continuous-time model of the Stochastic Gradient Descent (SGD) for the least-square problem. Indeed, pursuing the work of Li et al. (2019), we analyze Stochastic Differential Equations (SDEs) that model SGD either in the case of the training loss (finite samples) or the population one (online setting). A key qualitative feature of the dynamics is the existence of a perfect interpolator of the data, irrespective of the sample size. In both scenarios, we provide precise, non-asymptotic rates of convergence to the (possibly degenerate) stationary distribution. Additionally, we describe this asymptotic distribution, offering estimates of its mean, deviations from it, and a proof of the emergence of heavy-tails related to the step-size magnitude. Numerical simulations supporting our findings are also presented.

Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent

TL;DR

Abstract

Paper Structure (46 sections, 13 theorems, 159 equations, 3 figures)

This paper contains 46 sections, 13 theorems, 159 equations, 3 figures.

Introduction
Purpose and contributions.
Further related work.
Organisation of the paper.
Set-up: Stochastic Gradient Descent on Least Squares Problems
The least-square problem: population and empirical losses.
Noisy and noiseless settings.
Noisy setting $\boldsymbol{\mathcal{I}} \boldsymbol{=} \boldsymbol{\emptyset}$.
Noiseless setting $\boldsymbol{\mathcal{I}} \boldsymbol{\neq} \boldsymbol{\emptyset}$.
The stochastic gradient descent
The population case: online SGD.
The empirical case: training SGD.
Continuous model of SGD
The requirement of a SDE model
Explicit form of the SDE models
...and 31 more sections

Key Result

Lemma 4.2

The vector $\theta_*$ is the orthogonal projection of $\theta_0$ into $\mathcal{I}$, that is

Figures (3)

Figure 1: Plot showing the error of SGD along time for an overparametrized regime where $n = 100$ and $d = 200$. The samples $(x_i)_{i\leq n}$ come from a Gaussian distribution with a covariance whose eigenvalues decay as a power law. The vertical dotted orange line illustrates the separation between the two regimes depicted by Theorem \ref{['Th:convergConst']}, the polynomial one before (a straight line in a log-log plot) and the exponential line, after typical time scale $1/\mu$. This illustrates perfectly the rates of convergence shown in Theorem \ref{['Th:convergConst']}.
Figure 2: Display of a two-dimensional projection of $10$ trajectories of SGD for $n = 100$, $d = 200$, in the case that there is an perfect interpolator $\theta_*$. The ellipses represent the level curves of the training loss.
Figure 3: Four plots showing the trajectory of SGD in the noisy setting. The arrow of time goes from top left to bottom right. We see that the two variance reduction methods (time average and decaying step-sizes) converge towards $\theta_*$ (confirming Propositions \ref{['prop:ergodic']} and \ref{['prop:stepsizedecay']}), while plain SGD has a stationary distribution with certain fluctuations around its mean $\theta_*$ as explained in Theorem \ref{['thm:convergence_noisy']} and Proposition \ref{['prop:localization']}. Plain SGD is faster to its invariant distribution as the variance reduction methods as shown in the convergence rates provided in the results.

Theorems & Definitions (33)

Remark 2.1: Link with RKHS
Example 2.2: Noisy model
Example 2.3: Underparametrized setting
Example 2.4: Overparametrized setting
Definition 4.1
Lemma 4.2
Theorem 4.3
Lemma 4.4
proof
Lemma 4.5
...and 23 more

Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent

TL;DR

Abstract

Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (33)