Table of Contents
Fetching ...

Adaptive Multilevel Newton: A Quadratically Convergent Optimization Method

Nick Tsipinakis, Panagiotis Tigkas, Panos Parpas

TL;DR

An adaptive multilevel Newton-type method with a principled automatic switch to full Newton once its quadratic phase is reached that consistently outperforms Newton's method, Gradient Descent, and the multilevel Newton method, indicating that second-order methods can outperform first-order methods even when Newton's method is initially slow.

Abstract

Newton's method may exhibit slower convergence than vanilla Gradient Descent in its initial phase on strongly convex problems. Classical Newton-type multilevel methods mitigate this but, like Gradient Descent, achieve only linear convergence near the minimizer. We introduce an adaptive multilevel Newton-type method with a principled automatic switch to full Newton once its quadratic phase is reached. The local quadratic convergence for strongly convex functions with Lipschitz continuous Hessians and for self-concordant functions is established and confirmed empirically. Although per-iteration cost can exceed that of classical multilevel schemes, the method is efficient and consistently outperforms Newton's method, Gradient Descent, and the multilevel Newton method, indicating that second-order methods can outperform first-order methods even when Newton's method is initially slow. The promising empirical results open new avenues for designing reduced-cost second- and high-order methods with extremely fast convergence rates.

Adaptive Multilevel Newton: A Quadratically Convergent Optimization Method

TL;DR

An adaptive multilevel Newton-type method with a principled automatic switch to full Newton once its quadratic phase is reached that consistently outperforms Newton's method, Gradient Descent, and the multilevel Newton method, indicating that second-order methods can outperform first-order methods even when Newton's method is initially slow.

Abstract

Newton's method may exhibit slower convergence than vanilla Gradient Descent in its initial phase on strongly convex problems. Classical Newton-type multilevel methods mitigate this but, like Gradient Descent, achieve only linear convergence near the minimizer. We introduce an adaptive multilevel Newton-type method with a principled automatic switch to full Newton once its quadratic phase is reached. The local quadratic convergence for strongly convex functions with Lipschitz continuous Hessians and for self-concordant functions is established and confirmed empirically. Although per-iteration cost can exceed that of classical multilevel schemes, the method is efficient and consistently outperforms Newton's method, Gradient Descent, and the multilevel Newton method, indicating that second-order methods can outperform first-order methods even when Newton's method is initially slow. The promising empirical results open new avenues for designing reduced-cost second- and high-order methods with extremely fast convergence rates.
Paper Structure (21 sections, 19 theorems, 118 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 19 theorems, 118 equations, 8 figures, 5 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $f$ be a strictly convex self-concordant function and suppose that the sequence $(\mathbf{x}_{k})_{k \in \mathbb{N}}$ is generated by $\mathbf{x}_{k+1} = \mathbf{x}_{k} - t_{k} \mathbf{Q}_{h,k}^{-1} \nabla f (\mathbf{x}_{k})$, where $\mathbf{Q}_{h,k}^{-1}$ as in (svd Q). Suppose also that $\var

Figures (8)

  • Figure 1: Non-convex minimization. All the methods in plot (a) are initialized at the origin, while in plot (b) the initializer is selected randomly by $\mathcal{N}(0,1)$. Plot (c) shows the convergence behavior of SigmaSVD for different values of $p$.
  • Figure 3: Non-convex minimization. All methods in plots from (a) to (c) are initialized at the origin while from (d) to (f) the initializer is selected randomly from a Gaussian $\mathcal{N}(0,1)$.
  • Figure 4: Log Linear Regression. Plots (a) and (c) show comparisons between the optimization algorithms for the regime $m > n$ while (b) and (d) for the regime $m < n$.
  • Figure 5: Logistic Regression. Plots (a) to (c) show the norm of the gradient vs cpu time in seconds while (d) to (f) show the norm of the gradient vs iterations for three machine learning datasets.
  • Figure 6: Support Vector Machines. Plots (a) and (b) show the norm of the gradient vs cpu time in seconds while (c) and (d) norm of the gradient vs iterations for two machine learning datasets.
  • ...and 3 more figures

Theorems & Definitions (37)

  • Definition 1
  • Theorem 3.1
  • Lemma 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Theorem 3.4
  • Remark 3.1
  • Remark 3.2
  • Lemma A.1: MR2142598
  • Lemma A.2: MR2142598
  • ...and 27 more