Table of Contents
Fetching ...

On Learning Rates and Schrödinger Operators

Bin Shi, Weijie J. Su, Michael I. Jordan

TL;DR

This paper presents a general theoretical analysis of the effect of the learning rate in stochastic gradient descent (SGD), and provides a mathematical interpretation of the benefits of using learning rate decay for nonconvex optimization.

Abstract

The learning rate is perhaps the single most important parameter in the training of neural networks and, more broadly, in stochastic (nonconvex) optimization. Accordingly, there are numerous effective, but poorly understood, techniques for tuning the learning rate, including learning rate decay, which starts with a large initial learning rate that is gradually decreased. In this paper, we present a general theoretical analysis of the effect of the learning rate in stochastic gradient descent (SGD). Our analysis is based on the use of a learning-rate-dependent stochastic differential equation (lr-dependent SDE) that serves as a surrogate for SGD. For a broad class of objective functions, we establish a linear rate of convergence for this continuous-time formulation of SGD, highlighting the fundamental importance of the learning rate in SGD, and contrasting to gradient descent and stochastic gradient Langevin dynamics. Moreover, we obtain an explicit expression for the optimal linear rate by analyzing the spectrum of the Witten-Laplacian, a special case of the Schrödinger operator associated with the lr-dependent SDE. Strikingly, this expression clearly reveals the dependence of the linear convergence rate on the learning rate -- the linear rate decreases rapidly to zero as the learning rate tends to zero for a broad class of nonconvex functions, whereas it stays constant for strongly convex functions. Based on this sharp distinction between nonconvex and convex problems, we provide a mathematical interpretation of the benefits of using learning rate decay for nonconvex optimization.

On Learning Rates and Schrödinger Operators

TL;DR

This paper presents a general theoretical analysis of the effect of the learning rate in stochastic gradient descent (SGD), and provides a mathematical interpretation of the benefits of using learning rate decay for nonconvex optimization.

Abstract

The learning rate is perhaps the single most important parameter in the training of neural networks and, more broadly, in stochastic (nonconvex) optimization. Accordingly, there are numerous effective, but poorly understood, techniques for tuning the learning rate, including learning rate decay, which starts with a large initial learning rate that is gradually decreased. In this paper, we present a general theoretical analysis of the effect of the learning rate in stochastic gradient descent (SGD). Our analysis is based on the use of a learning-rate-dependent stochastic differential equation (lr-dependent SDE) that serves as a surrogate for SGD. For a broad class of objective functions, we establish a linear rate of convergence for this continuous-time formulation of SGD, highlighting the fundamental importance of the learning rate in SGD, and contrasting to gradient descent and stochastic gradient Langevin dynamics. Moreover, we obtain an explicit expression for the optimal linear rate by analyzing the spectrum of the Witten-Laplacian, a special case of the Schrödinger operator associated with the lr-dependent SDE. Strikingly, this expression clearly reveals the dependence of the linear convergence rate on the learning rate -- the linear rate decreases rapidly to zero as the learning rate tends to zero for a broad class of nonconvex functions, whereas it stays constant for strongly convex functions. Based on this sharp distinction between nonconvex and convex problems, we provide a mathematical interpretation of the benefits of using learning rate decay for nonconvex optimization.

Paper Structure

This paper contains 40 sections, 27 theorems, 171 equations, 12 figures.

Key Result

Lemma 2.2

For any confining function $f$ and any initial probability density $\rho \in L^2(\mu_s^{-1})$, the lr-dependent SDE eqn: sgd_high_resolution_formally admits a weak solution whose probability density in $C^{1}\left([0,+\infty), L^{2}(\mu_s^{-1}) \right)$ is the unique solution to the Fokker--Planck--

Figures (12)

  • Figure 1: Training error using SGD with mini-batch size 32 to train an 8-layer convolutional neural network on CIFAR-10 krizhevsky2009learning. The first 90 epochs use a learning rate of $s = 0.006$, the next 120 epochs use $s = 0.003$, and the final 190 epochs use $s = 0.0005$. Note that the training error decreases as the learning rate $s$ decreases and a smaller $s$ leads to a larger number of epochs for SGD to reach a plateau. See he2016deep for further investigation of this phenomenon.
  • Figure 2: Illustrative examples showing distinct behaviors of GD, SGD, and SGLD. The $y$-axis displays the optimization error $\overline{f(x_k)} - f(x^\star)$, where $f(x^\star)$ denotes the minimum value of the objective and in the case of SGD and SGLD $\overline{f(x_k)}$ denotes an average over 1000 replications. The objective function is $f(x_1,x_2) = 5 \times 10^{-2}x_1^{2} + 2.5 \times 10^{-2}x_2^{2}$, with an initial point $(8,8)$, and the noise $\xi_k$ in the gradient follows a standard normal distribution. Note that SGD with $s=1$ is identical to SGLD with $s = 1$. As shown in the right panel, taking time $t = ks$ as the $x$-axis, the learning rate has little to no impact on GD and SGLD in terms of optimization error.
  • Figure 3: The dependence of the optimization dynamics of SGD on the learning rate differs between convex objectives and nonconvex objectives. The learning rate is set to either $s = 0.1$ or $s = 0.05$. The two top plots consider minimizing a convex function $f(x_1, x_2) = 5 \times 10^{-2}x_1^{2} + 2.5 \times 10^{-2}x_2^{2}$, with an initial point $(8,8)$, and the bottom plots consider minimizing a nonconvex function $f(x_1,x_2) = [(x_1+0.7)^{2} + 0.1](x_1 - 0.7)^{2} + (x_2 + 0.7)^{2} [(x_2 - 0.7)^{2} + 0.1 ]$, with an initial point $(-0.9, 0.9)$. The gradient noise is drawn from the standard normal distribution. All results are averaged over 10000 independent replications.
  • Figure 4: A one-dimensional nonconvex function $f$. The height difference between $x^\circ$ and $x^\bullet$ in this special case is the Morse saddle barrier $H_f$. See the formal definition in \ref{['def:barrier']}.
  • Figure 5: Scatter plots of the iterates $x_k \in \mathbb{R}^2$ of SGD for minimizing the nonconvex function in \ref{['fig: sgd_nonconvex_traj-gen']}. This function has four local minima, of which the bottom right one is the gloabl minimum. Each column corresponds to the same value of $t = ks$, and the first row and second row correspond to learning rates $0.1$ and $0.05$, respectively. The gradient noise is drawn from the standard normal distribution. Each plot is based on 10000 independent SGD runs using the noise generator "state 1-10000" in Matlab2019b, starting from an initial point $(-0.9, 0.9)$.
  • ...and 7 more figures

Theorems & Definitions (45)

  • Definition 2.1: Confining condition pavliotis2014stochasticmarkowich1999trend
  • Lemma 2.2: Existence and uniqueness of the weak solution
  • Definition 2.3: Villani condition villani2009hypocoercivity
  • Theorem 1
  • Proposition 3.1
  • Proposition 3.2
  • Corollary 3.3
  • Proposition 3.4
  • Theorem 2
  • Proposition 3.5
  • ...and 35 more