A Full Adagrad algorithm with O(Nd) operations
Antoine Godichon-Baggioni, Wei Lu, Bruno Portier
TL;DR
This work tackles the computational bottleneck of full-matrix AdaGrad by introducing a Robbins-Monro online estimator for $\Sigma^{-1/2}$, enabling efficient, full-matrix preconditioning with $\mathcal{O}(td^{2})$ per update and a streaming variant achieving $\mathcal{O}(N_t d)$. The proposed WAFA scheme combines averaged iterates with a positive-definite, data-adapted preconditioner to obtain asymptotic efficiency, including a central limit theorem for the averaged estimator. Theoretical results establish strong consistency, convergence rates, and asymptotic normality under carefully stated moment, Lipschitz, and Hessian conditions, while extensive simulations and real-data experiments demonstrate that full-matrix preconditioning yields clear gains when gradient coordinates are correlated and that streaming variants can dramatically reduce computation with little loss in accuracy. Overall, the paper provides a theoretically grounded, scalable approach to full AdaGrad with practical impact for large-scale, high-dimensional stochastic optimization.
Abstract
A novel approach is given to overcome the computational challenges of the full-matrix Adaptive Gradient algorithm (Full AdaGrad) in stochastic optimization. By developing a recursive method that estimates the inverse of the square root of the covariance of the gradient, alongside a streaming variant for parameter updates, the study offers efficient and practical algorithms for large-scale applications. This innovative strategy significantly reduces the complexity and resource demands typically associated with full-matrix methods, enabling more effective optimization processes. Moreover, the convergence rates of the proposed estimators and their asymptotic efficiency are given. Their effectiveness is demonstrated through numerical studies.
