Table of Contents
Fetching ...

Regularized Overestimated Newton

Danny Duan, Hanbaek Lyu

TL;DR

This work introduces Regularized Overestimated Newton (RON), a Newton-type method for smooth convex optimization that employs a PSD overestimation of the Hessian plus a Hessian-dependent regularizer $\lambda_n=\sqrt{L_H\|\nabla f(\boldsymbol{\theta}_n)\|}$. A practical rank-$k$ Hessian estimator via Random Pivoted Cholesky yields a per-iteration cost of $O(dk^2)$ and enables exact Hessian recovery when $k$ exceeds the local Hessian rank, connecting to the globally regularized Newton (GRN) variant. The authors prove a two-phase global convergence: $\mathbb{E}[f(\boldsymbol{\theta}_n)]-f^* = O(n^{-2})$ initially and $O(\overline{\varepsilon}_{\mathbb{E}} n^{-1})$ subsequently, and local convergence under a local Quadratic Growth condition: superlinear when the overestimation is exact and linear with an effective condition number $\frac{1.2\mu+\overline{\varepsilon}_{\mathbb{E}}}{\mu}$ when it is not. The framework is validated on entropic optimal transport and large linear inverse problems, showing superior performance when the Hessian has low-rank structure and that the method remains competitive with first-order approaches in degenerate landscapes. Overall, RON offers a scalable, theoretically-grounded Newton-type method with tunable global and local behavior and practical low-rank implementations via RPC.

Abstract

We propose Regularized Overestimated Newton (RON), a Newton-type method with low per-iteration cost and strong global and local convergence guarantees for smooth convex optimization. RON interpolates between gradient descent and globally regularized Newton, with behavior determined by the largest Hessian overestimation error. Globally, when the optimality gap of the objective is large, RON achieves an accelerated $O(n^{-2})$ convergence rate; when small, its rate becomes $O(n^{-1})$. Locally, RON converges superlinearly and linearly when the overestimation is exact and inexact, respectively, toward possibly non-isolated minima under the local Quadratic Growth (QG) condition. The linear rate is governed by an improved effective condition number depending on the overestimation error. Leveraging a recent randomized rank-$k$ Hessian approximation algorithm, we obtain a practical variant with $O(\text{dim}\cdot k^2)$ cost per iteration. When the Hessian rank is uniformly below $k$, RON achieves a per-iteration cost comparable to that of first-order methods while retaining the superior convergence rates even in degenerate local landscapes. We validate our theoretical findings through experiments on entropic optimal transport and inverse problems.

Regularized Overestimated Newton

TL;DR

This work introduces Regularized Overestimated Newton (RON), a Newton-type method for smooth convex optimization that employs a PSD overestimation of the Hessian plus a Hessian-dependent regularizer . A practical rank- Hessian estimator via Random Pivoted Cholesky yields a per-iteration cost of and enables exact Hessian recovery when exceeds the local Hessian rank, connecting to the globally regularized Newton (GRN) variant. The authors prove a two-phase global convergence: initially and subsequently, and local convergence under a local Quadratic Growth condition: superlinear when the overestimation is exact and linear with an effective condition number when it is not. The framework is validated on entropic optimal transport and large linear inverse problems, showing superior performance when the Hessian has low-rank structure and that the method remains competitive with first-order approaches in degenerate landscapes. Overall, RON offers a scalable, theoretically-grounded Newton-type method with tunable global and local behavior and practical low-rank implementations via RPC.

Abstract

We propose Regularized Overestimated Newton (RON), a Newton-type method with low per-iteration cost and strong global and local convergence guarantees for smooth convex optimization. RON interpolates between gradient descent and globally regularized Newton, with behavior determined by the largest Hessian overestimation error. Globally, when the optimality gap of the objective is large, RON achieves an accelerated convergence rate; when small, its rate becomes . Locally, RON converges superlinearly and linearly when the overestimation is exact and inexact, respectively, toward possibly non-isolated minima under the local Quadratic Growth (QG) condition. The linear rate is governed by an improved effective condition number depending on the overestimation error. Leveraging a recent randomized rank- Hessian approximation algorithm, we obtain a practical variant with cost per iteration. When the Hessian rank is uniformly below , RON achieves a per-iteration cost comparable to that of first-order methods while retaining the superior convergence rates even in degenerate local landscapes. We validate our theoretical findings through experiments on entropic optimal transport and inverse problems.

Paper Structure

This paper contains 13 sections, 13 theorems, 91 equations, 3 figures, 1 table, 2 algorithms.

Key Result

Lemma 3.1

If $\hat{\mathbf{B}}_n=\hat{\mathbf{H}}_n+\rho_n\mathbf{I}$ as in eq:RPC_hessian_approximation. Then for any $\tau_n\ge \varepsilon_n$, we have $\sigma_{\min}(\hat{\mathbf{B}}_n)+\tau_n\ge \frac{\varepsilon_n+\tau_n}{2}$. If $\hat{\mathbf{H}}_n$ is in fact a PSD underestimation of the Hessian, namel

Figures (3)

  • Figure 1: Example of flat local minimum $\boldsymbol{\theta}_{*}$. Contour represents the level curve of the objective.
  • Figure 2: Plots of gradient norm vs iteration (top row)/time (bottom row) for solving the EOT problem by various algorithms. In column a, the source and target distributions used are synthesized Gaussian distributions with means $\mu=0.3,0.7$ respectively and standard deviation $\sigma=0.001$ with dimension $5000$, the cost matrix has i.i.d. uniform random variables in $[0,1].$ In column b, the source and target distributions and the cost matrix are similarly generated, except the dimension is $10000$. In column c, the source and target distributions used in the third experiment come from two normalized images in the MNIST lecun1998gradient data set, and the cost matrix is the pixelwise $\ell_1$ distance.
  • Figure 3: Plots of singular values distributions of four matrices and the result of solving $\mathbf{A}\mathbf{x}=\mathbf{b}$ are plotted as least square error vs. flops used by the algorithms. Part a, the left-hand side, is the experiments on synthesized data. Part b, the right hand side, is the experiment on matrix Maragal2 in the SuiteSparse Matrix Collection kolodziej2019suitesparse.

Theorems & Definitions (25)

  • Lemma 3.1
  • proof
  • Theorem 3.2: Global convergence for convex objective
  • Theorem 3.3: Local contraction of gradient norm
  • Corollary 3.4: Superlinear local convergence with exact Hessian
  • Corollary 3.5: Linear local convergence with overestimated Hessian
  • Lemma 4.1: Stability
  • proof
  • Lemma 4.2: Descent
  • proof
  • ...and 15 more