Faster Adaptive Optimization via Expected Gradient Outer Product Reparameterization
Adela DePavia, Vasileios Charisopoulos, Rebecca Willett
TL;DR
The paper addresses the rotational sensitivity of adaptive optimizers by introducing an expected gradient outer product (EGOP) based orthonormal reparameterization. It defines tilde f(tilde{\theta}) = f(V tilde{\theta}) where V comes from the EGOP eigen-decomposition and proves that, for objectives with strong EGOP spectral decay, adaptive methods like Adagrad reach first-order stationary points faster, with convergence governed by the stable rank ${\rm sr}_f$. The work provides theoretical bounds linking improved convergence to EGOP properties (e.g., dense leading eigenvectors via the parameter $\beta$) and substantiates these findings with extensive experiments on linear, nonconvex neural, and convex objectives, including real-data image classification tasks. It offers practical scalability strategies, such as focusing on leading eigenvectors, block reparameterization, and approximate bases, to make EGOP reparameterization feasible for large models. Overall, the approach highlights how natural data’s geometry can be harnessed to accelerate adaptive optimization in deep learning and related domains.
Abstract
Adaptive optimization algorithms -- such as Adagrad, Adam, and their variants -- have found widespread use in machine learning, signal processing and many other settings. Several methods in this family are not rotationally equivariant, meaning that simple reparameterizations (i.e. change of basis) can drastically affect their convergence. However, their sensitivity to the choice of parameterization has not been systematically studied; it is not clear how to identify a "favorable" change of basis in which these methods perform best. In this paper we propose a reparameterization method and demonstrate both theoretically and empirically its potential to improve their convergence behavior. Our method is an orthonormal transformation based on the expected gradient outer product (EGOP) matrix, which can be approximated using either full-batch or stochastic gradient oracles. We show that for a broad class of functions, the sensitivity of adaptive algorithms to choice-of-basis is influenced by the decay of the EGOP matrix spectrum. We illustrate the potential impact of EGOP reparameterization by presenting empirical evidence and theoretical arguments that common machine learning tasks with "natural" data exhibit EGOP spectral decay.
