On Adaptive Stochastic Optimization for Streaming Data: A Newton's Method with O(dN) Operations

Antoine Godichon-Baggioni; Nicklas Werge

On Adaptive Stochastic Optimization for Streaming Data: A Newton's Method with O(dN) Operations

Antoine Godichon-Baggioni, Nicklas Werge

TL;DR

This paper introduces adaptive stochastic optimization methods that bridge the gap between addressing ill-conditioned problems while functioning in a streaming context, and presents an adaptive inversion-free Newton's method with a computational complexity matching that of first-order methods.

Abstract

Stochastic optimization methods encounter new challenges in the realm of streaming, characterized by a continuous flow of large, high-dimensional data. While first-order methods, like stochastic gradient descent, are the natural choice, they often struggle with ill-conditioned problems. In contrast, second-order methods, such as Newton's methods, offer a potential solution, but their computational demands render them impractical. This paper introduces adaptive stochastic optimization methods that bridge the gap between addressing ill-conditioned problems while functioning in a streaming context. Notably, we present an adaptive inversion-free Newton's method with a computational complexity matching that of first-order methods, $\mathcal{O}(dN)$, where $d$ represents the number of dimensions/features, and $N$ the number of data. Theoretical analysis confirms their asymptotic efficiency, and empirical evidence demonstrates their effectiveness, especially in scenarios involving complex covariance structures and challenging initializations. In particular, our adaptive Newton's methods outperform existing methods, while maintaining favorable computational efficiency.

On Adaptive Stochastic Optimization for Streaming Data: A Newton's Method with O(dN) Operations

TL;DR

Abstract

, where

represents the number of dimensions/features, and

the number of data. Theoretical analysis confirms their asymptotic efficiency, and empirical evidence demonstrates their effectiveness, especially in scenarios involving complex covariance structures and challenging initializations. In particular, our adaptive Newton's methods outperform existing methods, while maintaining favorable computational efficiency.

Paper Structure (38 sections, 15 theorems, 185 equations, 2 figures)

This paper contains 38 sections, 15 theorems, 185 equations, 2 figures.

Introduction
Contributions.
Related work.
Organization.
Notations.
Underlying Theoretical Framework
Adaptive Stochastic Optimization Methods
The Weighted Averaged Version
Applications to Newton's Method
Direct Streaming Stochastic Newton's Method
Streaming Stochastic Newton's methods with possibly $\mathcal{O}(dN_{t})$ operations
Weighted Averaged Version of Streaming Stochastic Newton's methods with possibly $\mathcal{O}(dN_{t})$ operations
Experiments
Least-Squares Regression
Logistic Regression
...and 23 more sections

Key Result

Theorem 1

Suppose ass::1ass::2ass::3 hold, along with the conditions in cond::step. Then, $\theta_{t}$ converges almost surely to $\theta^{*}$.

Figures (2)

Figure 1: Least-Squares Regression: Mean-squared error of the distance to the optimum $\theta^{*}$, plotted against the sample size of $1.000.000$, for various initializations. The initial points $\theta_{0}$ are generated as $\theta_{0}=\theta^{*}(1+rU)$, where $U$ is a uniform random variable on the unit sphere of $\mathbb{R}^{d}$, and $r$ takes values of $1$ (left) or $5$ (right). Each curve reports $\lVert\theta_{t}-\theta^{*}\rVert$ averaged over $50$ different epochs, with a different initial point drawn for each sample.
Figure 2: Logistic Regression: Mean-squared error of the distance to the optimum $\theta^{*}$, plotted against the sample size of $1.000.000$, for various initializations. The initial points $\theta_{0}$ are generated as $\theta_{0}=\theta^{*}(1+rU)$, where $U$ is a uniform random variable on the unit sphere of $\mathbb{R}^{d}$, and $r$ takes values of $1$ (left) or $5$ (right). Each curve reports $\lVert\theta_{t}-\theta^{*}\rVert$ averaged over $50$ different epochs, with a different initial point drawn for each sample.

Theorems & Definitions (15)

Theorem 1
Theorem 2
Theorem 3
Theorem 4
Theorem 5
Theorem 6
Theorem 7
Theorem A.1
Theorem A.2
Theorem A.3
...and 5 more

On Adaptive Stochastic Optimization for Streaming Data: A Newton's Method with O(dN) Operations

TL;DR

Abstract

On Adaptive Stochastic Optimization for Streaming Data: A Newton's Method with O(dN) Operations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (15)