Unified Convergence Theory of Stochastic and Variance-Reduced Cubic Newton Methods

El Mahdi Chayti; Nikita Doikov; Martin Jaggi

Unified Convergence Theory of Stochastic and Variance-Reduced Cubic Newton Methods

El Mahdi Chayti, Nikita Doikov, Martin Jaggi

TL;DR

The paper introduces a helper framework that unifies stochastic and variance-reduced Cubic Newton methods for non-convex optimization, enabling arbitrary batch sizes, noisy gradient/Hessian estimates, and lazy Hessian updates. It demonstrates how to construct gradient and Hessian estimators from cheap helpers and snapshots, deriving global convergence guarantees for general non-convex objectives and gradient-dominated classes. The authors present new algorithms, including a lazy stochastic second-order method and variance-reduced variants, and show improved arithmetic complexity under large-dimension regimes, along with practical benefits in auxiliary learning, core sets, and semi-supervised settings. Empirical results corroborate theoretical gains, illustrating substantial time and computation savings without sacrificing convergence. The work thus offers a flexible, broadly applicable framework for efficient second-order optimization in large-scale, non-convex problems.

Abstract

We study stochastic Cubic Newton methods for solving general possibly non-convex minimization problems. We propose a new framework, which we call the helper framework, that provides a unified view of the stochastic and variance-reduced second-order algorithms equipped with global complexity guarantees. It can also be applied to learning with auxiliary information. Our helper framework offers the algorithm designer high flexibility for constructing and analyzing the stochastic Cubic Newton methods, allowing arbitrary size batches, and the use of noisy and possibly biased estimates of the gradients and Hessians, incorporating both the variance reduction and the lazy Hessian updates. We recover the best-known complexities for the stochastic and variance-reduced Cubic Newton, under weak assumptions on the noise. A direct consequence of our theory is the new lazy stochastic second-order method, which significantly improves the arithmetic complexity for large dimension problems. We also establish complexity bounds for the classes of gradient-dominated objectives, that include convex and strongly convex problems. For Auxiliary Learning, we show that using a helper (auxiliary function) can outperform training alone if a given similarity measure is small.

Unified Convergence Theory of Stochastic and Variance-Reduced Cubic Newton Methods

TL;DR

Abstract

Paper Structure (39 sections, 1 theorem, 129 equations, 10 figures, 1 table, 1 algorithm)

This paper contains 39 sections, 1 theorem, 129 equations, 10 figures, 1 table, 1 algorithm.

Introduction
Notation and Assumptions
Computing gradients and Hessians.
Second-Order Optimization with Helper Functions
General principle.
Basic Stochastic Methods
Let the Objective Guide Us
Variance Reduction and Lazy Hessians
Choice of the parameter $m$ in Algorithm \ref{['alg:2']}.
General variance reduction.
Variance reduction with Lazy Hessians.
Other Applications
Gradient Dominated Functions
Experiments
To Be Lazy or Not to Be
...and 24 more sections

Key Result

Corollary 3.2

In Algorithm alg:2, let us choose $M = L$ and $m = 1$, with basic helpers hBasic. Then, according to Theorem THSCN, for any $\varepsilon>0$, to reach an $(\varepsilon,L)$-approximate second-order local minimum, we need at most $S = \frac{\sqrt{L}F_0}{\varepsilon^{3/2}}$ iterations with $b_g = (\frac

Figures (10)

Figure 1: Comparison of the convergence of different algorithms. We see that "Lazy VR"
Figure 2: Cubic Newton method with and without using the helper function $h$. For $m=1$, this is simply the classic Cubic Newton method. To give an intuitive meaning to the plot, $\frac{1}{m}$ is the percentage of labeled data used during training. We can clearly see that using our approach, we benefit a lot from the helper function $h$.
Figure 3: Comparison of the convergence of different algorithms. We see that using our approach, we benefit a lot from the helper function $h$.
Figure 4: Comparison of the convergence of the different algorithms. Except for gradient descent ("GD"), which performs very well in this case, again, the same conclusions as in Figure \ref{['fig:2']} with respect to "Lazy VR" can be said.
Figure 5: Effect of increasing the dimension on the convergence of the different optimization algorithms we consider. We notice that with increased dimension, the gap between our method "Lazy VR" and "full VR" widens meaning "lazy VR" saves more time as the dimension of the problem grows.
...and 5 more figures

Theorems & Definitions (6)

Remark 3.1
Corollary 3.2
Example 3.3
Example 3.4
Example F.1
Example F.2

Unified Convergence Theory of Stochastic and Variance-Reduced Cubic Newton Methods

TL;DR

Abstract

Unified Convergence Theory of Stochastic and Variance-Reduced Cubic Newton Methods

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (6)