Table of Contents
Fetching ...

A Full Adagrad algorithm with O(Nd) operations

Antoine Godichon-Baggioni, Wei Lu, Bruno Portier

TL;DR

This work tackles the computational bottleneck of full-matrix AdaGrad by introducing a Robbins-Monro online estimator for $\Sigma^{-1/2}$, enabling efficient, full-matrix preconditioning with $\mathcal{O}(td^{2})$ per update and a streaming variant achieving $\mathcal{O}(N_t d)$. The proposed WAFA scheme combines averaged iterates with a positive-definite, data-adapted preconditioner to obtain asymptotic efficiency, including a central limit theorem for the averaged estimator. Theoretical results establish strong consistency, convergence rates, and asymptotic normality under carefully stated moment, Lipschitz, and Hessian conditions, while extensive simulations and real-data experiments demonstrate that full-matrix preconditioning yields clear gains when gradient coordinates are correlated and that streaming variants can dramatically reduce computation with little loss in accuracy. Overall, the paper provides a theoretically grounded, scalable approach to full AdaGrad with practical impact for large-scale, high-dimensional stochastic optimization.

Abstract

A novel approach is given to overcome the computational challenges of the full-matrix Adaptive Gradient algorithm (Full AdaGrad) in stochastic optimization. By developing a recursive method that estimates the inverse of the square root of the covariance of the gradient, alongside a streaming variant for parameter updates, the study offers efficient and practical algorithms for large-scale applications. This innovative strategy significantly reduces the complexity and resource demands typically associated with full-matrix methods, enabling more effective optimization processes. Moreover, the convergence rates of the proposed estimators and their asymptotic efficiency are given. Their effectiveness is demonstrated through numerical studies.

A Full Adagrad algorithm with O(Nd) operations

TL;DR

This work tackles the computational bottleneck of full-matrix AdaGrad by introducing a Robbins-Monro online estimator for , enabling efficient, full-matrix preconditioning with per update and a streaming variant achieving . The proposed WAFA scheme combines averaged iterates with a positive-definite, data-adapted preconditioner to obtain asymptotic efficiency, including a central limit theorem for the averaged estimator. Theoretical results establish strong consistency, convergence rates, and asymptotic normality under carefully stated moment, Lipschitz, and Hessian conditions, while extensive simulations and real-data experiments demonstrate that full-matrix preconditioning yields clear gains when gradient coordinates are correlated and that streaming variants can dramatically reduce computation with little loss in accuracy. Overall, the paper provides a theoretically grounded, scalable approach to full AdaGrad with practical impact for large-scale, high-dimensional stochastic optimization.

Abstract

A novel approach is given to overcome the computational challenges of the full-matrix Adaptive Gradient algorithm (Full AdaGrad) in stochastic optimization. By developing a recursive method that estimates the inverse of the square root of the covariance of the gradient, alongside a streaming variant for parameter updates, the study offers efficient and practical algorithms for large-scale applications. This innovative strategy significantly reduces the complexity and resource demands typically associated with full-matrix methods, enabling more effective optimization processes. Moreover, the convergence rates of the proposed estimators and their asymptotic efficiency are given. Their effectiveness is demonstrated through numerical studies.
Paper Structure (22 sections, 14 theorems, 131 equations, 4 figures, 2 tables)

This paper contains 22 sections, 14 theorems, 131 equations, 4 figures, 2 tables.

Key Result

Theorem 3.1

Suppose Assumptions ass1, ass::moment and ass::hess hold. Suppose also that $2\gamma +2\nu > 3$ and $\nu + \beta < 1$. Then $\theta_{t}$ and $\theta_{t,\tau}$ defined by thetaT and thetaTAU converge almost surely to $\theta^{*}$.

Figures (4)

  • Figure 1: Linear regression case with $(N,d)=(500000,200)$. Mean squared error with respect to the sample size for AdaGrad and Full AdaGrad algorithms with their weighted averaged versions. Two values of $\Sigma_{X}$ are considered: $\Sigma_X = I_d$ (one the left) and $\Sigma_{X} = R$ (on the right).
  • Figure 2: Linear regression case with $(N,d)=(500000,400)$. Mean squared error with respect to the sample size for AdaGrad and Full AdaGrad algorithms with their weighted averaged versions. Two values of $\Sigma_{X}$ are considered: $\Sigma_X = I_d$ (one the left) and $\Sigma_{X} = R$ (on the right).
  • Figure 3: From the left to the right: boxplots of the estimation errors for $\Sigma^{-1/2}$, boxplot of the estimation errors for $\theta$ and boxplots of running time. In each case, $\Sigma_{X} = I_d$, $(N,d)=(1000000,400)$ and three possible values of the streaming batch size are considered: $n=1,\sqrt{d},d$.
  • Figure 4: From the left to the right: boxplots of the estimation errors for $\Sigma^{-1/2}$, boxplot of the estimation errors for $\theta$ and boxplots of running time. In each case, $\Sigma_{X} = R$, $(N,d)=(1000000,400)$ and three possible values of the streaming batch size are considered: $n=1,\sqrt{d},d$.

Theorems & Definitions (17)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Lemma 6.1
  • Lemma 6.2
  • Lemma 6.3
  • Lemma 6.4
  • ...and 7 more