Table of Contents
Fetching ...

Adaptive Multilevel Newton: A Quadratically Convergent Optimization Method

Nick Tsipinakis, Panos Parpas, Matthias Voigt

TL;DR

The paper tackles the inefficiency of pure Newton methods in the early optimization phase by proposing Adaptive Multilevel (AML) Newton, which adaptively switches between coarse-level subspaces and full Newton steps to realize a local quadratic convergence regime. It provides rigorous results: local quadratic rates for strongly convex functions with Lipschitz Hessians and for self-concordant functions, plus probabilistic quadratic convergence under Johnson–Lindenstrauss-type subspace embeddings. The AML--Newton framework demonstrates through extensive experiments that cheap early iterations paired with principled level-acceptance rules can outperform Newton, Gradient Descent, and classical multilevel Newton methods on structured, low-rank problems. The approach has practical impact for large-scale, ill-conditioned optimization where second-order information is crucial but expensive to compute from scratch. Overall, AML--Newton integrates automatic level selection with adaptive stepping to achieve fast, robust convergence comparable to Newton’s rate while reducing total runtime in the critical initial phase.

Abstract

Newton's method may exhibit slower convergence than vanilla Gradient Descent in its initial phase on strongly convex problems. Classical Newton-type multilevel methods mitigate this but, like Gradient Descent, achieve only linear convergence near the minimizer. We introduce an adaptive multilevel Newton-type method with a principled automatic switch to full Newton once its quadratic phase is reached. The local quadratic convergence for strongly convex functions with Lipschitz continuous Hessians and for self-concordant functions is established and confirmed empirically. Although per-iteration cost can exceed that of classical multilevel schemes, the method is efficient and consistently outperforms Newton's method, Gradient Descent, and the multilevel Newton method, indicating that second-order methods can outperform first-order methods even when Newton's method is initially slow.

Adaptive Multilevel Newton: A Quadratically Convergent Optimization Method

TL;DR

The paper tackles the inefficiency of pure Newton methods in the early optimization phase by proposing Adaptive Multilevel (AML) Newton, which adaptively switches between coarse-level subspaces and full Newton steps to realize a local quadratic convergence regime. It provides rigorous results: local quadratic rates for strongly convex functions with Lipschitz Hessians and for self-concordant functions, plus probabilistic quadratic convergence under Johnson–Lindenstrauss-type subspace embeddings. The AML--Newton framework demonstrates through extensive experiments that cheap early iterations paired with principled level-acceptance rules can outperform Newton, Gradient Descent, and classical multilevel Newton methods on structured, low-rank problems. The approach has practical impact for large-scale, ill-conditioned optimization where second-order information is crucial but expensive to compute from scratch. Overall, AML--Newton integrates automatic level selection with adaptive stepping to achieve fast, robust convergence comparable to Newton’s rate while reducing total runtime in the critical initial phase.

Abstract

Newton's method may exhibit slower convergence than vanilla Gradient Descent in its initial phase on strongly convex problems. Classical Newton-type multilevel methods mitigate this but, like Gradient Descent, achieve only linear convergence near the minimizer. We introduce an adaptive multilevel Newton-type method with a principled automatic switch to full Newton once its quadratic phase is reached. The local quadratic convergence for strongly convex functions with Lipschitz continuous Hessians and for self-concordant functions is established and confirmed empirically. Although per-iteration cost can exceed that of classical multilevel schemes, the method is efficient and consistently outperforms Newton's method, Gradient Descent, and the multilevel Newton method, indicating that second-order methods can outperform first-order methods even when Newton's method is initially slow.

Paper Structure

This paper contains 15 sections, 9 theorems, 76 equations, 6 figures, 6 tables, 1 algorithm.

Key Result

Lemma 3.1

Let $x_{k+1} = x_k + d_{m, k}$, where $d_{m, k}$ as in eq: coarse direction at m. If inequality ineq:conditions-coarse-model-new holds, then $\|\nabla f(x_{k})\| = 0 \iff \| R_i \nabla f(x_{k})\| = 0$ for all $k \in \mathbb{N}$ and $1 \leq i \leq m-1$.

Figures (6)

  • Figure 1: Minimization of the Poisson loss using a low-rank Generated dataset. Gradient Descent and Newton's methods achieve slow convergence rates in late and early stages of the minimization process, respectively. Gradient Descent significantly outperforms Newton's method in early stages. According to the figure, one must select $\varepsilon \geq 10^{-1}$, otherwise the hybrid process may converge to a sub-optimal solution due to Gradient Descent's inefficiency near the minimizer. More details on this experiment can be found in \ref{['sec: experiments']}.
  • Figure 1: The plots illustrate the local quadratic rates of \ref{['alg: multilevel']} for different choices of $\sigma$ in comparison to those of the Newton method. The number of fine steps taken by \ref{['alg: multilevel']} for each experiment can be found in \ref{['tab:sigma_all_datasets']}.
  • Figure 2: Comparisons between optimization algorithms on Poisson regression with Generated data. The left column shows the convergence rate, the right column shows the runtime in seconds. The plot shows the convergence behavior of optimization algorithms for low-rank structured problems. The $\ell_1$-regularization or a bad initial point pose difficulties to optimization methods. Newton's first phase can be extremely slow in such structures, while Gradient Descent and the ML--Newton method converge to sub-optimal solutions.
  • Figure 3: Comparisons between optimization algorithms on Logistic regression. Each row corresponds to a dataset; the left column shows the gradient norm vs. iterations plots, the right column shows the gradient norm vs. runtime in seconds. See \ref{['tab:logistic regression comparison']} for the number of fine and coarse steps computed by the multilevel methods.
  • Figure 4: Comparisons between optimization algorithms. Each row corresponds to a dataset; the left column shows the gradient norm vs. iterations plots, the right column shows the gradient norm vs. runtime in seconds. See \ref{['tab:logistic+poisson regression comparison']} for the number of fine and coarse steps computed by the multilevel methods.
  • ...and 1 more figures

Theorems & Definitions (19)

  • Lemma 3.1
  • Proof 1
  • Lemma 4.1
  • Proof 2
  • Theorem 4.2
  • Proof 3
  • Remark 4.3
  • Remark 4.4
  • Lemma 4.5: Lemma 3.1 in tsipinakis2025convergence
  • Lemma 4.6
  • ...and 9 more