Table of Contents
Fetching ...

A practical randomized trust-region method to escape saddle points in high dimension

Radu-Alexandru Dragomir, Xiaowen Jiang, Bonan Sun, Nicolas Boumal

Abstract

Without randomization, escaping the saddle points of $f \colon \mathbb{R}^d \to \mathbb{R}$ requires at least $Ω(d)$ pieces of information about $f$ (values, gradients, Hessian-vector products). With randomization, this can be reduced to a polylogarithmic dependence in $d$. The prototypical algorithm to that effect is perturbed gradient descent (PGD): through sustained jitter, it reliably escapes strict saddle points. However, it also never settles: there is no convergence. What is more, PGD requires precise tuning based on Lipschitz constants and a preset target accuracy. To improve on this, we modify the time-tested trust-region method with truncated conjugate gradients (TR-tCG). Specifically, we randomize the initialization of tCG (the subproblem solver), and we prove that tCG automatically amplifies the randomization near saddles (to escape) and absorbs it near local minimizers (to converge). Saddle escape happens over several iterations. Accordingly, our analysis is multi-step, with several novelties. The proposed algorithm is practical: it essentially tracks the good behavior of TR-tCG, with three minute modifications and a single new hyperparameter (the noise scale $σ$). We provide an implementation and numerical experiments.

A practical randomized trust-region method to escape saddle points in high dimension

Abstract

Without randomization, escaping the saddle points of requires at least pieces of information about (values, gradients, Hessian-vector products). With randomization, this can be reduced to a polylogarithmic dependence in . The prototypical algorithm to that effect is perturbed gradient descent (PGD): through sustained jitter, it reliably escapes strict saddle points. However, it also never settles: there is no convergence. What is more, PGD requires precise tuning based on Lipschitz constants and a preset target accuracy. To improve on this, we modify the time-tested trust-region method with truncated conjugate gradients (TR-tCG). Specifically, we randomize the initialization of tCG (the subproblem solver), and we prove that tCG automatically amplifies the randomization near saddles (to escape) and absorbs it near local minimizers (to converge). Saddle escape happens over several iterations. Accordingly, our analysis is multi-step, with several novelties. The proposed algorithm is practical: it essentially tracks the good behavior of TR-tCG, with three minute modifications and a single new hyperparameter (the noise scale ). We provide an implementation and numerical experiments.
Paper Structure (63 sections, 35 theorems, 196 equations, 6 figures, 3 algorithms)

This paper contains 63 sections, 35 theorems, 196 equations, 6 figures, 3 algorithms.

Key Result

Theorem 3.4

Assume that the noise scale satisfies $0 < \sigma \leq \bar{\sigma}_{\rm global}$. Then, the radius is lower bounded as $\Delta_k \geq 8 \bar{R}$ for every $k$ (Lemma lemma:lowbound_deltak) and, for any target accuracy $\epsilon \leq \bar{R}$, with probability at least $1-\delta$, the method finds a outer iterations. These require at most $K+2$ evaluations of $f$ and $\nabla f$, and at most Hessi

Figures (6)

  • Figure 1: Demonstrating the two scenarios in $\mathtt{tCG\text{-}bg}$ (Algorithm \ref{['algo:tCG']}). If the $\mathtt{residual}$ criterion triggers while the iterates are still inside the ball of radius $\Delta/2$, we stop there (and the method is equivalent to standard CG). If the iterates reach the sphere of radius $\Delta/2$, we perform a last gradient step from $v^{(T)}$. This last step unlocks a descent guarantee in the non-convex case.
  • Figure 2: Illustration of the CG dynamics on a non-convex model $m$ (Proposition \ref{['prop:vt_qi_growth']}). The vectors $q_+$ and $q_-$ are direction of respectively positive and negative curvature, and $\bar{v}$ is the saddle point. Since the projection of $v^{(t)}-\bar{v}$ on $q_-$ increases along the iterations, the iterates never enter the blue area. They eventually reach the boundary of the trust region (dotted line).
  • Figure 3: Experimental run for Problem 1 (sine saddle, see \ref{['eq:sinesaddle']}). All methods escape the saddle point and converge to a global minimizer. The dashed lines (TR-tCG with and without Hessian shift) overlap almost perfectly, as do the curves for TR-tCG-bg with $\sigma = 10^{-6}$.
  • Figure 4: Experimental run for Problem 2 (nonlinear synchronization, see \ref{['eq:nonlinearsynchro']}).
  • Figure 5: Experimental run for Problem 3 (rectangular matrix approximation, see \ref{['eq:matrixapprox_rect']}) with three different choices of rank $r$ and regularization parameter $\lambda$.
  • ...and 1 more figures

Theorems & Definitions (76)

  • Definition 3.1
  • Definition 3.2
  • Example 3.3: Rank-one matrix factorization
  • Theorem 3.4: Simplification of Theorems \ref{['thm:outer_complexity']} and \ref{['thm:global_inner_its']}
  • Theorem 3.5: Saddle escape, simplification of Theorem \ref{['thm:saddle_escape']}
  • Lemma 4.1
  • proof
  • Lemma 4.2: Classical properties of CG
  • proof
  • Lemma 4.3
  • ...and 66 more