A Full Adagrad algorithm with O(Nd) operations

Antoine Godichon-Baggioni; Wei Lu; Bruno Portier

A Full Adagrad algorithm with O(Nd) operations

Antoine Godichon-Baggioni, Wei Lu, Bruno Portier

TL;DR

This work tackles the computational bottleneck of full-matrix AdaGrad by introducing a Robbins-Monro online estimator for $\Sigma^{-1/2}$, enabling efficient, full-matrix preconditioning with $\mathcal{O}(td^{2})$ per update and a streaming variant achieving $\mathcal{O}(N_t d)$. The proposed WAFA scheme combines averaged iterates with a positive-definite, data-adapted preconditioner to obtain asymptotic efficiency, including a central limit theorem for the averaged estimator. Theoretical results establish strong consistency, convergence rates, and asymptotic normality under carefully stated moment, Lipschitz, and Hessian conditions, while extensive simulations and real-data experiments demonstrate that full-matrix preconditioning yields clear gains when gradient coordinates are correlated and that streaming variants can dramatically reduce computation with little loss in accuracy. Overall, the paper provides a theoretically grounded, scalable approach to full AdaGrad with practical impact for large-scale, high-dimensional stochastic optimization.

Abstract

A novel approach is given to overcome the computational challenges of the full-matrix Adaptive Gradient algorithm (Full AdaGrad) in stochastic optimization. By developing a recursive method that estimates the inverse of the square root of the covariance of the gradient, alongside a streaming variant for parameter updates, the study offers efficient and practical algorithms for large-scale applications. This innovative strategy significantly reduces the complexity and resource demands typically associated with full-matrix methods, enabling more effective optimization processes. Moreover, the convergence rates of the proposed estimators and their asymptotic efficiency are given. Their effectiveness is demonstrated through numerical studies.

A Full Adagrad algorithm with O(Nd) operations

TL;DR

This work tackles the computational bottleneck of full-matrix AdaGrad by introducing a Robbins-Monro online estimator for

, enabling efficient, full-matrix preconditioning with

per update and a streaming variant achieving

. The proposed WAFA scheme combines averaged iterates with a positive-definite, data-adapted preconditioner to obtain asymptotic efficiency, including a central limit theorem for the averaged estimator. Theoretical results establish strong consistency, convergence rates, and asymptotic normality under carefully stated moment, Lipschitz, and Hessian conditions, while extensive simulations and real-data experiments demonstrate that full-matrix preconditioning yields clear gains when gradient coordinates are correlated and that streaming variants can dramatically reduce computation with little loss in accuracy. Overall, the paper provides a theoretically grounded, scalable approach to full AdaGrad with practical impact for large-scale, high-dimensional stochastic optimization.

Abstract

Paper Structure (22 sections, 14 theorems, 131 equations, 4 figures, 2 tables)

This paper contains 22 sections, 14 theorems, 131 equations, 4 figures, 2 tables.

Introduction
Framework
A Full AdaGrad algorithm with $\mathcal{O}(td^{2})$ operations
Estimating $\Sigma^{-1/2}$ with the help of a Robbins-Monro algorithm
Full AdaGrad algorithms with $\mathcal{O}(td^{2})$ operations
A Streaming Full AdaGrad algorithm with $\mathcal{O} (N_{t}d)$ operations
Applications
Discussion about the hyper-parameters involved in the different algorithms
Linear regression on simulated data
AdaGrad vs. Full AdaGrad
Study of the full Adagrad streaming version.
Logistic regression on real data
Proofs
Proof of Theorem \ref{['theo::consistency']}
Proof of Theorem \ref{['theo::rate']}
...and 7 more sections

Key Result

Theorem 3.1

Suppose Assumptions ass1, ass::moment and ass::hess hold. Suppose also that $2\gamma +2\nu > 3$ and $\nu + \beta < 1$. Then $\theta_{t}$ and $\theta_{t,\tau}$ defined by thetaT and thetaTAU converge almost surely to $\theta^{*}$.

Figures (4)

Figure 1: Linear regression case with $(N,d)=(500000,200)$. Mean squared error with respect to the sample size for AdaGrad and Full AdaGrad algorithms with their weighted averaged versions. Two values of $\Sigma_{X}$ are considered: $\Sigma_X = I_d$ (one the left) and $\Sigma_{X} = R$ (on the right).
Figure 2: Linear regression case with $(N,d)=(500000,400)$. Mean squared error with respect to the sample size for AdaGrad and Full AdaGrad algorithms with their weighted averaged versions. Two values of $\Sigma_{X}$ are considered: $\Sigma_X = I_d$ (one the left) and $\Sigma_{X} = R$ (on the right).
Figure 3: From the left to the right: boxplots of the estimation errors for $\Sigma^{-1/2}$, boxplot of the estimation errors for $\theta$ and boxplots of running time. In each case, $\Sigma_{X} = I_d$, $(N,d)=(1000000,400)$ and three possible values of the streaming batch size are considered: $n=1,\sqrt{d},d$.
Figure 4: From the left to the right: boxplots of the estimation errors for $\Sigma^{-1/2}$, boxplot of the estimation errors for $\theta$ and boxplots of running time. In each case, $\Sigma_{X} = R$, $(N,d)=(1000000,400)$ and three possible values of the streaming batch size are considered: $n=1,\sqrt{d},d$.

Theorems & Definitions (17)

Theorem 3.1
Theorem 3.2
Theorem 3.3
Theorem 4.1
Theorem 4.2
Theorem 4.3
Lemma 6.1
Lemma 6.2
Lemma 6.3
Lemma 6.4
...and 7 more

A Full Adagrad algorithm with O(Nd) operations

TL;DR

Abstract

A Full Adagrad algorithm with O(Nd) operations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (17)