Complexity reduction in online stochastic Newton methods with potential O(N d) total cost
Antoine Godichon-Baggioni, Bruno Portier, Guillaume Sallé
TL;DR
The paper tackles scalable optimization of smooth convex functions in a stochastic online setting by introducing the masked Stochastic Newton Algorithm (mSNA), which leverages online mini-batch Hessian information with random column masking to reduce per-iteration cost. It proves almost sure convergence and asymptotic efficiency without iterate averaging, and demonstrates that the streaming/multi-batch variant can achieve a total complexity of $O(Nd)$ for a full data pass with $b=d$ and $\ell=1$. The algorithm maintains a positive definite inverse-Hessian estimator via a reduced-cost SGD on a quadratic functional, enabling effective second-order updates in high dimensions. Numerical experiments on synthetic and real data corroborate the theoretical guarantees and show competitive performance against first-order methods, confirming the practical viability of second-order online methods in large-scale streaming contexts.
Abstract
Optimizing smooth convex functions in stochastic settings, where only noisy estimates of gradients and Hessians are available, is a fundamental problem in optimization. While first-order methods possess a low per-iteration cost, their convergence is slow for ill-conditioned problems. Stochastic Newton methods utilize second-order information to correct for local curvature, but the O(d 3 ) per-iteration cost of computing and inverting a full Hessian, where d is the problem dimension, is prohibitive in high dimensions. This paper introduces an online mini-batch stochastic Newton algorithm. The method employs a random masking strategy that selects a subset of Hessian columns at each iteration, substantially reducing the per-step computational cost. This approach allows the algorithm, in the mini-batch setting, to achieve a total computational cost for a single pass over N data points of O(N d), which is comparable to first-order methods while retaining the advantages of second-order information. We establish the almost sure convergence and asymptotic efficiency of the resulting estimator. This property is obtained without requiring iterate averaging, which distinguishes this work from prior analyses.
