Table of Contents
Fetching ...

An Alternative View: When Does SGD Escape Local Minima?

Robert Kleinberg, Yuanzhi Li, Yang Yuan

TL;DR

The paper reframes SGD as gradient descent on a noise-smoothed version of the loss and proves that if this convolved loss is one-point strongly convex toward a target solution x*, SGD will converge toward and stay near x* with constant probability. It formalizes a main Assumption about one-point convexity after convolution, derives a finite-horizon bound via an Azuma-based argument, and demonstrates empirical evidence that neural network loss surfaces exhibit local one-point convexity around SGD trajectories. The findings offer a principled explanation for SGD's ability to escape sharp local minima and favor flatter regions, and highlight the importance of learning-rate schedules in practice. The work connects theoretical smoothing by gradient noise to observed optimization dynamics in deep networks.

Abstract

Stochastic gradient descent (SGD) is widely used in machine learning. Although being commonly viewed as a fast but not accurate version of gradient descent (GD), it always finds better solutions than GD for modern neural networks. In order to understand this phenomenon, we take an alternative view that SGD is working on the convolved (thus smoothed) version of the loss function. We show that, even if the function $f$ has many bad local minima or saddle points, as long as for every point $x$, the weighted average of the gradients of its neighborhoods is one point convex with respect to the desired solution $x^*$, SGD will get close to, and then stay around $x^*$ with constant probability. More specifically, SGD will not get stuck at "sharp" local minima with small diameters, as long as the neighborhoods of these regions contain enough gradient information. The neighborhood size is controlled by step size and gradient noise. Our result identifies a set of functions that SGD provably works, which is much larger than the set of convex functions. Empirically, we observe that the loss surface of neural networks enjoys nice one point convexity properties locally, therefore our theorem helps explain why SGD works so well for neural networks.

An Alternative View: When Does SGD Escape Local Minima?

TL;DR

The paper reframes SGD as gradient descent on a noise-smoothed version of the loss and proves that if this convolved loss is one-point strongly convex toward a target solution x*, SGD will converge toward and stay near x* with constant probability. It formalizes a main Assumption about one-point convexity after convolution, derives a finite-horizon bound via an Azuma-based argument, and demonstrates empirical evidence that neural network loss surfaces exhibit local one-point convexity around SGD trajectories. The findings offer a principled explanation for SGD's ability to escape sharp local minima and favor flatter regions, and highlight the importance of learning-rate schedules in practice. The work connects theoretical smoothing by gradient noise to observed optimization dynamics in deep networks.

Abstract

Stochastic gradient descent (SGD) is widely used in machine learning. Although being commonly viewed as a fast but not accurate version of gradient descent (GD), it always finds better solutions than GD for modern neural networks. In order to understand this phenomenon, we take an alternative view that SGD is working on the convolved (thus smoothed) version of the loss function. We show that, even if the function has many bad local minima or saddle points, as long as for every point , the weighted average of the gradients of its neighborhoods is one point convex with respect to the desired solution , SGD will get close to, and then stay around with constant probability. More specifically, SGD will not get stuck at "sharp" local minima with small diameters, as long as the neighborhoods of these regions contain enough gradient information. The neighborhood size is controlled by step size and gradient noise. Our result identifies a set of functions that SGD provably works, which is much larger than the set of convex functions. Empirically, we observe that the loss surface of neural networks enjoys nice one point convexity properties locally, therefore our theorem helps explain why SGD works so well for neural networks.

Paper Structure

This paper contains 14 sections, 7 theorems, 27 equations, 6 figures.

Key Result

Theorem 1

Assume $f$ is smooth, for every $x\in \mathbb{D}$, $W(x)$ s.t., $\max_{\omega\sim W(x)}\|\omega\|_2 \leq r$. Also assume $\eta$ is bounded by a constant, and Assumption assump:main:assumption holds with $x^*, \eta$, and $c$. For $T_1\geq \tilde{O}(\frac{1}{\eta c})$We use $\tilde{O}$ to hide $\log$

Figures (6)

  • Figure 1: SGD path $x_t\rightarrow x_{t+1}$ can be decomposed into $x_t\rightarrow y_t \rightarrow x_{t+1}$. If the local minimum basin has small diameter, the gradient at $x_{t+1}$ will point away from the basin.
  • Figure 2: 3D version of Figure \ref{['fig:illustration']}: SGD could escape a local minimum within one step.
  • Figure 3: Running SGD on a spiky function $f$. Row 1:$f$ gets smoother after convolving with uniform random noise. Row 2: Run SGD with different noise levels. Every figure is obtained with $100$ trials with different random initializations. Red dots represent the last iterates of these trials, while blue bars represent the cumulative counts. GD without noise easily gets stuck at various local minima, while SGD with appropriate noise level converges to a local region. Row 3: In order to get closer to $x^*$, one may run SGD in multiple stages with shrinking learning rates.
  • Figure 4: When step size is too big, even the gradient is one point convex, we may still go farther away from $x^*$.
  • Figure 5: (a). The inner product between the negative gradient and $x_{300}-x_t$ for each epoch $t\geq 5$ is always positive. Every data point is the minimum value among $5$ trials. (b). Neighborhood of SGD trajectory is also one point convex with respect to $x_{300}$. (c). Norm of stochastic gradient
  • ...and 1 more figures

Theorems & Definitions (11)

  • Theorem 1: Main Theorem, Informal
  • Definition 1: Smoothness
  • Theorem 1: \ref{['thm:main']}
  • Corollary 2: Shrinking Learning Rate
  • Theorem 3
  • proof
  • Theorem 4: Azuma
  • Lemma 5
  • Lemma 6: Hoeffding bound hoeffding
  • proof
  • ...and 1 more