Table of Contents
Fetching ...

Gradient Descent Can Take Exponential Time to Escape Saddle Points

Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Barnabas Poczos, Aarti Singh

TL;DR

<p>The paper investigates gradient descent on non-convex functions and shows that, despite prior asymptotic guarantees for escaping saddle points, vanilla GD can require exponential time to escape even under natural random initializations. It contrasts this with perturbed gradient descent, which achieves polynomial-time escape, thereby justifying perturbations for efficient non-convex optimization. The authors construct a smooth, high-dimensional counterexample with a tube/octopus geometry and use spline connections and Whitney extension to prove the exponential-time behavior for GD while PGD escapes quickly; they also provide experiments validating the theory. The results highlight a fundamental speed difference between GD and its perturbed variant and point to practical implications for optimization in non-convex settings and possible extensions to stochastic algorithms.</p>

Abstract

Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points - it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.

Gradient Descent Can Take Exponential Time to Escape Saddle Points

TL;DR

<p>The paper investigates gradient descent on non-convex functions and shows that, despite prior asymptotic guarantees for escaping saddle points, vanilla GD can require exponential time to escape even under natural random initializations. It contrasts this with perturbed gradient descent, which achieves polynomial-time escape, thereby justifying perturbations for efficient non-convex optimization. The authors construct a smooth, high-dimensional counterexample with a tube/octopus geometry and use spline connections and Whitney extension to prove the exponential-time behavior for GD while PGD escapes quickly; they also provide experiments validating the theory. The results highlight a fundamental speed difference between GD and its perturbed variant and point to practical implications for optimization in non-convex settings and possible extensions to stochastic algorithms.</p>

Abstract

Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points - it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.

Paper Structure

This paper contains 27 sections, 11 theorems, 68 equations, 4 figures, 1 algorithm.

Key Result

Theorem 2.6

Suppose that $f$ is $\ell$-gradient Lipschitz, has continuous Hessian, and step size $\eta <\frac{1}{\ell}$. Furthermore, assume that gradient descent converges, meaning $\lim_{t \to \infty} \mathbf{x}^{(t)}$ exists, and the initialization distribution $\nu$ is absolutely continuous with respect to

Figures (4)

  • Figure 1: If the initialization point is in red rectangle then it takes GD a long time to escape the neighborhood of saddle point $(0,0)$.
  • Figure 2: Graphical illustrations of our counter-example with $\tau = e$. The blue points are saddle points and the red point is the minimum. The pink line is the trajectory of gradient descent.
  • Figure 3: Performance of GD and PGD on our counter-example with $d=5$.
  • Figure 5: Illustration of intersection surfaces used in our construction.

Theorems & Definitions (20)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Definition 2.4
  • Definition 2.5
  • Theorem 2.6: lee2016gradient
  • Theorem 2.7: jin2017escape
  • Theorem 4.1: Uniform initialization over a unit cube
  • Corollary 4.2
  • Corollary 4.3
  • ...and 10 more