Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting

Benjamin Gess; Sebastian Kassing

Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting

Benjamin Gess, Sebastian Kassing

Abstract

We prove explicit bounds on the exponential rate of convergence for the momentum stochastic gradient descent scheme (MSGD) for arbitrary, fixed hyperparameters (learning rate, friction parameter) and its continuous-in-time counterpart in the context of non-convex optimization. In the small step-size regime and in the case of flat minima or large noise intensities, these bounds prove faster convergence of MSGD compared to plain stochastic gradient descent (SGD). The results are shown for objective functions satisfying a local Polyak-Lojasiewicz inequality and under assumptions on the variance of MSGD that are satisfied in overparametrized settings. Moreover, we analyze the optimal choice of the friction parameter and show that the MSGD process almost surely converges to a local minimum.

Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting

Abstract

Paper Structure (6 sections, 18 theorems, 153 equations, 4 figures)

This paper contains 6 sections, 18 theorems, 153 equations, 4 figures.

Introduction
Loss landscape and noise in empirical risk minimization
Momentum stochastic gradient descent in discrete time
Lyapunov estimates
The small step-size case
Momentum stochastic gradient descent in continuous time

Key Result

Theorem 1.1

(See Theorem theo1 and Theorem rem:constrained1) Let $\gamma_n \equiv \gamma>0$. Let $L>0$ and $\sigma \ge 0$. Let $\mathcal{D}\subset {\mathbb R}^d$ be an open set and assume that for all $x \in \mathcal{D}$ Moreover, for $n \in {\mathbb N}_0$, let ${\mathbb A}_n = \{X_i \in \mathcal{D} \text{ for all } i =0, \dots, n\}$ and assume that If there exist parameters $a,b \ge 0$ such that all of the

Figures (4)

Figure 1: Comparison of the convergence rate $r_{\text{MSGD}}$ for MSGD and the convergence rate $r_{\text{SGD}}$ for SGD in the sense of \ref{['eq:rate']} for fixed $\gamma=0.01$ and $\sigma=0$, different values of $L$ ($y$-axis) and $\kappa = \frac{C_L}{L}$ ($x$-axis) and optimally chosen friction parameter $\mu^*$. Blue represents an outperformance of MSGD, red represents an outperformance of SGD.
Figure 2: Comparison of the convergence rate $r_{\text{MSGD}}$ for MSGD and the convergence rate $r_{\text{SGD}}$ for SGD in the sense of \ref{['eq:rate']} for fixed $\gamma=0.01$ and $\sigma=100$, different values of $L$ ($y$-axis) and $\kappa = \frac{C_L}{L}$ ($x$-axis) and optimally chosen friction parameter $\mu^*$. For $\kappa \ge 4$ one has $r_{\text{SGD}}<1$ so that SGD does not converge.
Figure 3: Comparison of the convergence rate $r_{\text{MSGD}}$ for MSGD and the convergence rate $r_{\text{SGD}}$ for SGD in the sense of \ref{['eq:rate']} for fixed $L=\frac{1}{50}$, $C_L=\frac{3}{50}$ and different values of $\gamma$ ($y$-axis) and $\mu$ ($x$-axis). The figure shows the value $(r_{\text{MSGD}}-r_{\text{SGD}})/\gamma$.
Figure 4: Comparison of the convergence rate $m$ for MSGD ( blue) and SGD ( orange) in continuous time in the sense of Theorem \ref{['theoSDE1']} (i) depending on the noise intensity $\sigma$ ($x$-axis) for different values of $L$ and $C_L$.

Theorems & Definitions (40)

Theorem 1.1
Theorem 1.2
Lemma 2.1
proof
Remark 2.2
Theorem 3.1
Theorem 3.2
Corollary 3.3
Theorem 3.4
Remark 3.5
...and 30 more

Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting

Abstract

Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting

Authors

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (40)