Complexity reduction in online stochastic Newton methods with potential O(N d) total cost

Antoine Godichon-Baggioni; Bruno Portier; Guillaume Sallé

Complexity reduction in online stochastic Newton methods with potential O(N d) total cost

Antoine Godichon-Baggioni, Bruno Portier, Guillaume Sallé

TL;DR

The paper tackles scalable optimization of smooth convex functions in a stochastic online setting by introducing the masked Stochastic Newton Algorithm (mSNA), which leverages online mini-batch Hessian information with random column masking to reduce per-iteration cost. It proves almost sure convergence and asymptotic efficiency without iterate averaging, and demonstrates that the streaming/multi-batch variant can achieve a total complexity of $O(Nd)$ for a full data pass with $b=d$ and $\ell=1$. The algorithm maintains a positive definite inverse-Hessian estimator via a reduced-cost SGD on a quadratic functional, enabling effective second-order updates in high dimensions. Numerical experiments on synthetic and real data corroborate the theoretical guarantees and show competitive performance against first-order methods, confirming the practical viability of second-order online methods in large-scale streaming contexts.

Abstract

Optimizing smooth convex functions in stochastic settings, where only noisy estimates of gradients and Hessians are available, is a fundamental problem in optimization. While first-order methods possess a low per-iteration cost, their convergence is slow for ill-conditioned problems. Stochastic Newton methods utilize second-order information to correct for local curvature, but the O(d 3 ) per-iteration cost of computing and inverting a full Hessian, where d is the problem dimension, is prohibitive in high dimensions. This paper introduces an online mini-batch stochastic Newton algorithm. The method employs a random masking strategy that selects a subset of Hessian columns at each iteration, substantially reducing the per-step computational cost. This approach allows the algorithm, in the mini-batch setting, to achieve a total computational cost for a single pass over N data points of O(N d), which is comparable to first-order methods while retaining the advantages of second-order information. We establish the almost sure convergence and asymptotic efficiency of the resulting estimator. This property is obtained without requiring iterate averaging, which distinguishes this work from prior analyses.

Complexity reduction in online stochastic Newton methods with potential O(N d) total cost

TL;DR

for a full data pass with

and

. The algorithm maintains a positive definite inverse-Hessian estimator via a reduced-cost SGD on a quadratic functional, enabling effective second-order updates in high dimensions. Numerical experiments on synthetic and real data corroborate the theoretical guarantees and show competitive performance against first-order methods, confirming the practical viability of second-order online methods in large-scale streaming contexts.

Abstract

Paper Structure (43 sections, 11 theorems, 89 equations, 2 figures, 2 tables)

This paper contains 43 sections, 11 theorems, 89 equations, 2 figures, 2 tables.

Introduction
Related works.
Contributions.
Paper Organization.
Framework
Notations.
Problem setting
In practice
Online Estimation of the Inverse of a Positive Definite Matrix
Stochastic Gradient Estimation of the Inverse
A Reduced-Cost Positive Estimator of the Inverse
Convergence Results
Relation to Previous Works
Remark on a Weighted Averaged Version
Algorithms
...and 28 more sections

Key Result

Proposition 3.1

Let $(\hat{\theta}_n)_{n \geq 0}$ be a sequence of estimators of $\theta^*$ adapted to the filtration $(\mathcal{F}_n)_{n \geq 0}$, and let $(A_n)_{n \geq 0}$ be the sequence defined by eq:def A_n using $H_n \coloneq h_n(\hat{\theta}_{n-1})$. Suppose Assumptions Assumption oracles unbiased, Assumpti

Figures (2)

Figure 1: Linear regression with $d=1000$. From the left to the right, comparison of: evolution of the quadratic errors $\left\lVert \theta_{n} - \theta^{*} \right\rVert^{2}$, quadratic errors $\left\lVert A_{n} - H^{-1} \right\rVert_{F}^{2}$, computational time for the different methods.
Figure 2: Logistic regression with $d=1000$. From the left to the right, comparison of: evolution of the quadratic errors $\| \theta_{n} - \theta^{*}\|^{2}$, quadratic errors $\| A_{n} - H^{-1}\|_{F}^{2}$, computational time for the different methods.

Theorems & Definitions (19)

Proposition 3.1
Theorem 4.1
Corollary 4.2
Theorem 4.3
Remark 4.4
Theorem 4.5
Corollary 5.1
proof : Proof of Proposition \ref{['prop::cv rate of A_n']}
Lemma A.1
proof : Proof of Theorem \ref{['thm::USNA']}
...and 9 more

Complexity reduction in online stochastic Newton methods with potential O(N d) total cost

TL;DR

Abstract

Complexity reduction in online stochastic Newton methods with potential O(N d) total cost

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (19)