Sketch-and-Project Meets Newton Method: Global $\mathcal O(k^{-2})$ Convergence with Low-Rank Updates

Slavomír Hanzely

Sketch-and-Project Meets Newton Method: Global $\mathcal O(k^{-2})$ Convergence with Low-Rank Updates

Slavomír Hanzely

TL;DR

This work addresses scalable second-order optimization for convex, self-concordant objectives by introducing SGN, a Sketchy Global Newton method that operates in random low-rank subspaces. SGN unifies sketch-and-project, subspace Newton, and subspace regularized Newton updates, delivering a global ${O}(k^{-2})$ convergence rate while keeping per-iteration costs at ${O}(d\tau^2)$ (and ${O}(1)$ when $\tau=1$). It additionally provides a fast local linear convergence independent of conditioning and a global linear convergence regime under relative smoothness/convexity, all under affine-invariant geometric assumptions. Empirical results on LIBSVM logistic-loss problems corroborate the theory, showing SGN can match or approach the performance of state-of-the-art Newton-like methods with substantially cheaper updates, highlighting its practical impact for large-scale machine learning.

Abstract

In this paper, we propose the first sketch-and-project Newton method with fast $\mathcal O(k^{-2})$ global convergence rate for self-concordant functions. Our method, SGN, can be viewed in three ways: i) as a sketch-and-project algorithm projecting updates of Newton method, ii) as a cubically regularized Newton ethod in sketched subspaces, and iii) as a damped Newton method in sketched subspaces. SGN inherits best of all three worlds: cheap iteration costs of sketch-and-project methods, state-of-the-art $\mathcal O(k^{-2})$ global convergence rate of full-rank Newton-like methods and the algorithm simplicity of damped Newton methods. Finally, we demonstrate its comparable empirical performance to baseline algorithms.

Sketch-and-Project Meets Newton Method: Global $\mathcal O(k^{-2})$ Convergence with Low-Rank Updates

TL;DR

convergence rate while keeping per-iteration costs at

(and

when

). It additionally provides a fast local linear convergence independent of conditioning and a global linear convergence regime under relative smoothness/convexity, all under affine-invariant geometric assumptions. Empirical results on LIBSVM logistic-loss problems corroborate the theory, showing SGN can match or approach the performance of state-of-the-art Newton-like methods with substantially cheaper updates, highlighting its practical impact for large-scale machine learning.

Abstract

In this paper, we propose the first sketch-and-project Newton method with fast

global convergence rate for self-concordant functions. Our method, SGN, can be viewed in three ways: i) as a sketch-and-project algorithm projecting updates of Newton method, ii) as a cubically regularized Newton ethod in sketched subspaces, and iii) as a damped Newton method in sketched subspaces. SGN inherits best of all three worlds: cheap iteration costs of sketch-and-project methods, state-of-the-art

global convergence rate of full-rank Newton-like methods and the algorithm simplicity of damped Newton methods. Finally, we demonstrate its comparable empirical performance to baseline algorithms.

Paper Structure (37 sections, 21 theorems, 86 equations, 3 figures, 4 tables, 5 algorithms)

This paper contains 37 sections, 21 theorems, 86 equations, 3 figures, 4 tables, 5 algorithms.

Introduction
Demands of modern machine learning
Contributions
Objective
Affine-invarant geometry
Algorithm
Three faces of the algorithm
Geometry of sketches
Affine-invariant assumptions
One step decrease
Main convergence results
Global convex $\mathcal{O} \left( k^{-2} \right)$ convergence
Fast local linear convergence
Global linear convergence
Experiments
...and 22 more sections

Key Result

Theorem 1

If $\nabla f(x^k) \in {\rm Range}\left( \nabla^2 f(x^k)\right)$${\rm Range}\left( \mathcal{A}\right)$ denotes column space of the matrix $\mathcal{A}$., then the update rules are equivalent: where $\mathbf P_{x^{k}}$ is a projection matrix onto ${\rm Range}\left( \mathbf S_k\right)$ with respect to norm ${\left \| \cdot \right\|}_{x_k}$ (defined in eq. eq:px), We call this algorithm Sketchy Glob

Figures (3)

Figure 1: Comparison of SSCN, SGN and CD on the logistic regression loss on LIBSVM datasets for sketch matrices $\mathbf S$ of rank one. We fine-tune all algorithms for their smoothness parameters.
Figure 2: Comparison of SSCN, SGN, CD and ACD on logistic regression on LIBSVM datasets for sketch matrices $\mathbf S$ of rank one. We fine-tune all algorithms for smoothness parameters.
Figure 3: Exact Newton Descent KSJ-Newton2018

Theorems & Definitions (38)

Theorem 1
Lemma 1
Lemma 2
Lemma 3
Definition 1
Definition 2
Lemma 4
Proposition 1: hanzely2022damped, Lemma 2
Lemma 5
Theorem 2
...and 28 more

Sketch-and-Project Meets Newton Method: Global $\mathcal O(k^{-2})$ Convergence with Low-Rank Updates

TL;DR

Abstract

Sketch-and-Project Meets Newton Method: Global $\mathcal O(k^{-2})$ Convergence with Low-Rank Updates

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (38)