The Power of Normalization: Faster Evasion of Saddle Points
Kfir Y. Levy
TL;DR
The paper addresses the challenge of saddle points in non-convex optimization by introducing Saddle-NGD, a normalized gradient descent variant with periodic noise to promote saddle-escape. It proves convergence to a local minimum with a rate of $\tilde{O}(\eta^{-3/2})$ iterations and a function-value gap of $\tilde{O}(\eta)$, with an emphasis on the dimension-dependent max learning rate $\eta_{\max}=\tilde{O}(1/d^2)$, improving over prior noisy-GD results. The stochastic extension shows comparable sample complexity to noisy-GD, while empirical results on online tensor decomposition and ICA demonstrate practical benefits in environments with saddle points and multi-modality. Altogether, the work provides a theoretically grounded, scalable first-order method for rapidly escaping saddles and converging to local minima in high-dimensional non-convex problems.
Abstract
A commonly used heuristic in non-convex optimization is Normalized Gradient Descent (NGD) - a variant of gradient descent in which only the direction of the gradient is taken into account and its magnitude ignored. We analyze this heuristic and show that with carefully chosen parameters and noise injection, this method can provably evade saddle points. We establish the convergence of NGD to a local minimum, and demonstrate rates which improve upon the fastest known first order algorithm due to Ge e al. (2015). The effectiveness of our method is demonstrated via an application to the problem of online tensor decomposition; a task for which saddle point evasion is known to result in convergence to global minima.
