Table of Contents
Fetching ...

Complexity Guarantees for Nonconvex Newton-MR Under Inexact Hessian Information

Alexander Lim, Fred Roosta

TL;DR

An extension of the Newton-MR algorithm for nonconvex unconstrained optimization to the settings where Hessian information is approximated and it is shown that, under certain conditions, the algorithm achieves global linear convergence rate.

Abstract

We consider an extension of the Newton-MR algorithm for nonconvex unconstrained optimization to the settings where Hessian information is approximated. Under a particular noise model on the Hessian matrix, we investigate the iteration and operation complexities of this variant to achieve appropriate sub-optimality criteria in several nonconvex settings. We do this by first considering functions that satisfy the (generalized) Polyak-Łojasiewicz condition, a special sub-class of nonconvex functions. We show that, under certain conditions, our algorithm achieves global linear convergence rate. We then consider more general nonconvex settings where the rate to obtain first order sub-optimality is shown to be sub-linear. In all these settings, we show that our algorithm converges regardless of the degree of approximation of the Hessian as well as the accuracy of the solution to the sub-problem. Finally, we compare the performance of our algorithm with several alternatives on a few machine learning problems.

Complexity Guarantees for Nonconvex Newton-MR Under Inexact Hessian Information

TL;DR

An extension of the Newton-MR algorithm for nonconvex unconstrained optimization to the settings where Hessian information is approximated and it is shown that, under certain conditions, the algorithm achieves global linear convergence rate.

Abstract

We consider an extension of the Newton-MR algorithm for nonconvex unconstrained optimization to the settings where Hessian information is approximated. Under a particular noise model on the Hessian matrix, we investigate the iteration and operation complexities of this variant to achieve appropriate sub-optimality criteria in several nonconvex settings. We do this by first considering functions that satisfy the (generalized) Polyak-Łojasiewicz condition, a special sub-class of nonconvex functions. We show that, under certain conditions, our algorithm achieves global linear convergence rate. We then consider more general nonconvex settings where the rate to obtain first order sub-optimality is shown to be sub-linear. In all these settings, we show that our algorithm converges regardless of the degree of approximation of the Hessian as well as the accuracy of the solution to the sub-problem. Finally, we compare the performance of our algorithm with several alternatives on a few machine learning problems.
Paper Structure (20 sections, 8 theorems, 53 equations, 21 figures, 4 algorithms)

This paper contains 20 sections, 8 theorems, 53 equations, 21 figures, 4 algorithms.

Key Result

Lemma 1

Suppose cond:LC has not yet been detected at iteration $t$. For any vector $\mathbf{v} \in \mathcal{K}_{t}({\mathbf{\bar{H}}_k}, {\mathbf{g}_k})$, we have $\left\langle{\mathbf{v}, \mathbf{\bar{H}}_k\mathbf{v}}\right\rangle \geq \sigma \|\mathbf{v}\|^2$.

Figures (21)

  • Figure 1: Performance of \ref{['alg:NewtonMR']} using various degrees of Hessian approximation on the nonconvex nonlinear least squares loss function. As predicted by our theory, \ref{['alg:NewtonMR']} converges irrespective of the degree of Hessian approximation. Also, Hessian approximation typically reduces computational costs; however, a substantial reduction in sub-sample size can lead to a significant loss of curvature information, resulting in poor performances.
  • Figure 2: Performance of \ref{['alg:NewtonMR']} using various degrees of Hessian approximation on the convex logistic loss function. As predicted by our theory, \ref{['alg:NewtonMR']} converges irrespective of the degree of Hessian approximation. Also, Hessian approximation typically reduces computational costs; however, a substantial reduction in sub-sample size can lead to a significant loss of curvature information, resulting in poor performances.
  • Figure 3: Comparison of Newton-MR and Newton-CG on CIFAR10 dataset in \ref{['sec:exp:ffnn']}.
  • Figure 4: Comparison of Newton-MR and Trust-Region on CIFAR10 dataset in \ref{['sec:exp:ffnn']}.
  • Figure 5: Comparison of Newton-MR and L-BFGS on CIFAR10 dataset in \ref{['sec:exp:ffnn']}.
  • ...and 16 more figures

Theorems & Definitions (25)

  • Definition 1: theta-Polyak-Ł ojasiewicz Condition
  • Definition 2: $\varepsilon_f$-Global Optimality
  • Definition 3: $\varepsilon_\mathbf{g}$-First Order Optimality
  • Definition 4: $\sigma$-Limited Curvature Direction
  • Definition 5: Inexact Hessian
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • ...and 15 more