Table of Contents
Fetching ...

The Power of Normalization: Faster Evasion of Saddle Points

Kfir Y. Levy

TL;DR

The paper addresses the challenge of saddle points in non-convex optimization by introducing Saddle-NGD, a normalized gradient descent variant with periodic noise to promote saddle-escape. It proves convergence to a local minimum with a rate of $\tilde{O}(\eta^{-3/2})$ iterations and a function-value gap of $\tilde{O}(\eta)$, with an emphasis on the dimension-dependent max learning rate $\eta_{\max}=\tilde{O}(1/d^2)$, improving over prior noisy-GD results. The stochastic extension shows comparable sample complexity to noisy-GD, while empirical results on online tensor decomposition and ICA demonstrate practical benefits in environments with saddle points and multi-modality. Altogether, the work provides a theoretically grounded, scalable first-order method for rapidly escaping saddles and converging to local minima in high-dimensional non-convex problems.

Abstract

A commonly used heuristic in non-convex optimization is Normalized Gradient Descent (NGD) - a variant of gradient descent in which only the direction of the gradient is taken into account and its magnitude ignored. We analyze this heuristic and show that with carefully chosen parameters and noise injection, this method can provably evade saddle points. We establish the convergence of NGD to a local minimum, and demonstrate rates which improve upon the fastest known first order algorithm due to Ge e al. (2015). The effectiveness of our method is demonstrated via an application to the problem of online tensor decomposition; a task for which saddle point evasion is known to result in convergence to global minima.

The Power of Normalization: Faster Evasion of Saddle Points

TL;DR

The paper addresses the challenge of saddle points in non-convex optimization by introducing Saddle-NGD, a normalized gradient descent variant with periodic noise to promote saddle-escape. It proves convergence to a local minimum with a rate of iterations and a function-value gap of , with an emphasis on the dimension-dependent max learning rate , improving over prior noisy-GD results. The stochastic extension shows comparable sample complexity to noisy-GD, while empirical results on online tensor decomposition and ICA demonstrate practical benefits in environments with saddle points and multi-modality. Altogether, the work provides a theoretically grounded, scalable first-order method for rapidly escaping saddles and converging to local minima in high-dimensional non-convex problems.

Abstract

A commonly used heuristic in non-convex optimization is Normalized Gradient Descent (NGD) - a variant of gradient descent in which only the direction of the gradient is taken into account and its magnitude ignored. We analyze this heuristic and show that with carefully chosen parameters and noise injection, this method can provably evade saddle points. We establish the convergence of NGD to a local minimum, and demonstrate rates which improve upon the fastest known first order algorithm due to Ge e al. (2015). The effectiveness of our method is demonstrated via an application to the problem of online tensor decomposition; a task for which saddle point evasion is known to result in convergence to global minima.

Paper Structure

This paper contains 31 sections, 14 theorems, 92 equations, 2 figures, 1 algorithm.

Key Result

Theorem 3

Let $\xi \in (0,1)$, $\eta \in (0,{\eta_{\max}})$. Also assume that $f:\mathbb{R}^d\mapsto\mathbb{R}$ is $(\alpha,\gamma,\nu,r)$-strict-saddle, $\beta$-smooth, has $\rho$-Lipschitz Hessians, and also $|f(x)|\leq B;\; \forall x\in\mathbb{R}^d$. Then w.p.$\geq 1-\xi$, Algorithm algorithm:SNGD reaches

Figures (2)

  • Figure 1: GD vs. NGD around a pure saddle. Left: gradients. Middle: normalized gradients. On the Right we compare GD against NGD, we present $F(0,0)-F(x_1,x_2)$ vs. $\#$iterations.
  • Figure 2: Noisy-GD Vs. Saddle-NGD for the online ICA problem.

Theorems & Definitions (30)

  • Definition 1: Strict-saddle
  • Definition 2
  • Theorem 3
  • Corollary 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • proof : Proof Sketch of Theorem \ref{['theorem:MAinPaper']}
  • proof : Proof of Lemma \ref{['lem:LargeGrads']}
  • Lemma 8
  • ...and 20 more