Skip the Hessian, Keep the Rates: Globalized Semismooth Newton with Lazy Hessian Updates

Amal Alphonse; Pavel Dvurechensky; Clemens Sirotenko

Skip the Hessian, Keep the Rates: Globalized Semismooth Newton with Lazy Hessian Updates

Amal Alphonse, Pavel Dvurechensky, Clemens Sirotenko

TL;DR

This work introduces GLAd-SSN, a globalized semismooth Newton method with lazy Hessian updates that achieves global convergence rates and local superlinear convergence under semismoothness, extending to γ-order semismoothness for accelerated rates. The key ideas combine proximal Newton steps with periodic Hessian reuse, a gradient-regularized adaptive regularization parameter, and linesearch, enabling efficient handling of nonsmooth objectives in potentially infinite-dimensional Hilbert spaces. Theoretical results cover nonconvex, PL, and convex regimes, detailing global rates, linear and superlinear behavior, and refined rates in finite-dimensional C^2 settings. Experiments on neural networks with Lipschitz constraints, nonnegative matrix factorization, and SVMs demonstrate substantial speedups from Hessian laziness without sacrificing convergence guarantees, highlighting practical impact for large-scale ML optimization with nonsmooth elements.

Abstract

Second-order methods are provably faster than first-order methods, and their efficient implementations for large-scale optimization problems have attracted significant attention. Yet, optimization problems in ML often have nonsmooth derivatives, which makes the existing convergence rate theory of second-order methods inapplicable. In this paper, we propose a new semismooth Newton method (SSN) that enjoys both global convergence rates and asymptotic superlinear convergence without requiring second-order differentiability. Crucially, our method does not require (generalized) Hessians to be evaluated at each iteration but only periodically, and it reuses stale Hessians otherwise (i.e., it performs lazy Hessian updates), saving compute cost and often leading to significant speedups in time, whilst still maintaining strong global and local convergence rate guarantees. We develop our theory in an infinite-dimensional setting and illustrate it with numerical experiments on matrix factorization and neural networks with Lipschitz constraints.

Skip the Hessian, Keep the Rates: Globalized Semismooth Newton with Lazy Hessian Updates

TL;DR

Abstract

Paper Structure (45 sections, 31 theorems, 112 equations, 6 figures, 2 tables)

This paper contains 45 sections, 31 theorems, 112 equations, 6 figures, 2 tables.

Introduction
Related works.
Contributions.
Notation.
Semismooth Newton Method
Global convergence rates
Non-convex problems
Non-convex problems under PL condition
Convex problems
Superlinear convergence
Technical preliminaries
Superlinear convergence under semismoothness
Faster rate under $\gamma$-order semismoothness
Faster rates in the finite-dimensional C^ 2 setting
Discussion
...and 30 more sections

Key Result

Lemma 3.0

Let ass:basic hold and consider iteration $k \geq 0$ of alg:proximal_newton applied to eq:opt. If $g_k>0$ and $\lambda \geq L$, both inequalities in eq:stop_cond_alg hold, i.e., the acceptance conditions of the inner loop of alg:proximal_newton hold and the inner loop ends after a finite number of t

Figures (6)

Figure 1: Training neural networks with Lipschitz constraints (\ref{['sec:numerics_NN']}). Adam is almost invisible on the left figure as it finishes 10,000 epochs very quickly, but GLAd-SSN performs an order of magnitude better in iteration count. Laziness helps significantly!
Figure 2: Estimated Lipschitz constant of the learned neural networks (\ref{['sec:numerics_NN']})
Figure 3: Violation of the non-negativity constraint in the matrix factorization example (\ref{['sec:numericsNMF']}); the smaller the better.
Figure 4: Non-negative matrix factorization (\ref{['sec:numericsNMF']}): the gradient method finishes fast but performs poorly, LeAP-SSN does well initially but gets overtaken by GLAd-SSN.
Figure 5: Support vector machine classification (\ref{['sec:numerics_SVM']}): comparable behavior between GLAd-SSN and LeAP-SSN, and both perform much better than the gradient method.
...and 1 more figures

Theorems & Definitions (51)

Lemma 3.0
Theorem 3.1
Theorem 3.2: Global linear convergence
Theorem 3.3
Proposition 4.0
Theorem 4.1: Superlinear convergence under semismoothness
Lemma 4.1
Theorem 4.2: Main theorem, semismoothness
Theorem 4.3: Superlinear convergence under $\gamma$-order semismoothness, $\lzsymb=1$
Remark 4.4
...and 41 more

Skip the Hessian, Keep the Rates: Globalized Semismooth Newton with Lazy Hessian Updates

TL;DR

Abstract

Skip the Hessian, Keep the Rates: Globalized Semismooth Newton with Lazy Hessian Updates

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (51)