Table of Contents
Fetching ...

On Adaptive Stochastic Optimization for Streaming Data: A Newton's Method with O(dN) Operations

Antoine Godichon-Baggioni, Nicklas Werge

TL;DR

This paper introduces adaptive stochastic optimization methods that bridge the gap between addressing ill-conditioned problems while functioning in a streaming context, and presents an adaptive inversion-free Newton's method with a computational complexity matching that of first-order methods.

Abstract

Stochastic optimization methods encounter new challenges in the realm of streaming, characterized by a continuous flow of large, high-dimensional data. While first-order methods, like stochastic gradient descent, are the natural choice, they often struggle with ill-conditioned problems. In contrast, second-order methods, such as Newton's methods, offer a potential solution, but their computational demands render them impractical. This paper introduces adaptive stochastic optimization methods that bridge the gap between addressing ill-conditioned problems while functioning in a streaming context. Notably, we present an adaptive inversion-free Newton's method with a computational complexity matching that of first-order methods, $\mathcal{O}(dN)$, where $d$ represents the number of dimensions/features, and $N$ the number of data. Theoretical analysis confirms their asymptotic efficiency, and empirical evidence demonstrates their effectiveness, especially in scenarios involving complex covariance structures and challenging initializations. In particular, our adaptive Newton's methods outperform existing methods, while maintaining favorable computational efficiency.

On Adaptive Stochastic Optimization for Streaming Data: A Newton's Method with O(dN) Operations

TL;DR

This paper introduces adaptive stochastic optimization methods that bridge the gap between addressing ill-conditioned problems while functioning in a streaming context, and presents an adaptive inversion-free Newton's method with a computational complexity matching that of first-order methods.

Abstract

Stochastic optimization methods encounter new challenges in the realm of streaming, characterized by a continuous flow of large, high-dimensional data. While first-order methods, like stochastic gradient descent, are the natural choice, they often struggle with ill-conditioned problems. In contrast, second-order methods, such as Newton's methods, offer a potential solution, but their computational demands render them impractical. This paper introduces adaptive stochastic optimization methods that bridge the gap between addressing ill-conditioned problems while functioning in a streaming context. Notably, we present an adaptive inversion-free Newton's method with a computational complexity matching that of first-order methods, , where represents the number of dimensions/features, and the number of data. Theoretical analysis confirms their asymptotic efficiency, and empirical evidence demonstrates their effectiveness, especially in scenarios involving complex covariance structures and challenging initializations. In particular, our adaptive Newton's methods outperform existing methods, while maintaining favorable computational efficiency.
Paper Structure (38 sections, 15 theorems, 185 equations, 2 figures)

This paper contains 38 sections, 15 theorems, 185 equations, 2 figures.

Key Result

Theorem 1

Suppose ass::1ass::2ass::3 hold, along with the conditions in cond::step. Then, $\theta_{t}$ converges almost surely to $\theta^{*}$.

Figures (2)

  • Figure 1: Least-Squares Regression: Mean-squared error of the distance to the optimum $\theta^{*}$, plotted against the sample size of $1.000.000$, for various initializations. The initial points $\theta_{0}$ are generated as $\theta_{0}=\theta^{*}(1+rU)$, where $U$ is a uniform random variable on the unit sphere of $\mathbb{R}^{d}$, and $r$ takes values of $1$ (left) or $5$ (right). Each curve reports $\lVert\theta_{t}-\theta^{*}\rVert$ averaged over $50$ different epochs, with a different initial point drawn for each sample.
  • Figure 2: Logistic Regression: Mean-squared error of the distance to the optimum $\theta^{*}$, plotted against the sample size of $1.000.000$, for various initializations. The initial points $\theta_{0}$ are generated as $\theta_{0}=\theta^{*}(1+rU)$, where $U$ is a uniform random variable on the unit sphere of $\mathbb{R}^{d}$, and $r$ takes values of $1$ (left) or $5$ (right). Each curve reports $\lVert\theta_{t}-\theta^{*}\rVert$ averaged over $50$ different epochs, with a different initial point drawn for each sample.

Theorems & Definitions (15)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • Theorem A.1
  • Theorem A.2
  • Theorem A.3
  • ...and 5 more