Table of Contents
Fetching ...

Faster Adaptive Optimization via Expected Gradient Outer Product Reparameterization

Adela DePavia, Vasileios Charisopoulos, Rebecca Willett

TL;DR

The paper addresses the rotational sensitivity of adaptive optimizers by introducing an expected gradient outer product (EGOP) based orthonormal reparameterization. It defines tilde f(tilde{\theta}) = f(V tilde{\theta}) where V comes from the EGOP eigen-decomposition and proves that, for objectives with strong EGOP spectral decay, adaptive methods like Adagrad reach first-order stationary points faster, with convergence governed by the stable rank ${\rm sr}_f$. The work provides theoretical bounds linking improved convergence to EGOP properties (e.g., dense leading eigenvectors via the parameter $\beta$) and substantiates these findings with extensive experiments on linear, nonconvex neural, and convex objectives, including real-data image classification tasks. It offers practical scalability strategies, such as focusing on leading eigenvectors, block reparameterization, and approximate bases, to make EGOP reparameterization feasible for large models. Overall, the approach highlights how natural data’s geometry can be harnessed to accelerate adaptive optimization in deep learning and related domains.

Abstract

Adaptive optimization algorithms -- such as Adagrad, Adam, and their variants -- have found widespread use in machine learning, signal processing and many other settings. Several methods in this family are not rotationally equivariant, meaning that simple reparameterizations (i.e. change of basis) can drastically affect their convergence. However, their sensitivity to the choice of parameterization has not been systematically studied; it is not clear how to identify a "favorable" change of basis in which these methods perform best. In this paper we propose a reparameterization method and demonstrate both theoretically and empirically its potential to improve their convergence behavior. Our method is an orthonormal transformation based on the expected gradient outer product (EGOP) matrix, which can be approximated using either full-batch or stochastic gradient oracles. We show that for a broad class of functions, the sensitivity of adaptive algorithms to choice-of-basis is influenced by the decay of the EGOP matrix spectrum. We illustrate the potential impact of EGOP reparameterization by presenting empirical evidence and theoretical arguments that common machine learning tasks with "natural" data exhibit EGOP spectral decay.

Faster Adaptive Optimization via Expected Gradient Outer Product Reparameterization

TL;DR

The paper addresses the rotational sensitivity of adaptive optimizers by introducing an expected gradient outer product (EGOP) based orthonormal reparameterization. It defines tilde f(tilde{\theta}) = f(V tilde{\theta}) where V comes from the EGOP eigen-decomposition and proves that, for objectives with strong EGOP spectral decay, adaptive methods like Adagrad reach first-order stationary points faster, with convergence governed by the stable rank . The work provides theoretical bounds linking improved convergence to EGOP properties (e.g., dense leading eigenvectors via the parameter ) and substantiates these findings with extensive experiments on linear, nonconvex neural, and convex objectives, including real-data image classification tasks. It offers practical scalability strategies, such as focusing on leading eigenvectors, block reparameterization, and approximate bases, to make EGOP reparameterization feasible for large models. Overall, the approach highlights how natural data’s geometry can be harnessed to accelerate adaptive optimization in deep learning and related domains.

Abstract

Adaptive optimization algorithms -- such as Adagrad, Adam, and their variants -- have found widespread use in machine learning, signal processing and many other settings. Several methods in this family are not rotationally equivariant, meaning that simple reparameterizations (i.e. change of basis) can drastically affect their convergence. However, their sensitivity to the choice of parameterization has not been systematically studied; it is not clear how to identify a "favorable" change of basis in which these methods perform best. In this paper we propose a reparameterization method and demonstrate both theoretically and empirically its potential to improve their convergence behavior. Our method is an orthonormal transformation based on the expected gradient outer product (EGOP) matrix, which can be approximated using either full-batch or stochastic gradient oracles. We show that for a broad class of functions, the sensitivity of adaptive algorithms to choice-of-basis is influenced by the decay of the EGOP matrix spectrum. We illustrate the potential impact of EGOP reparameterization by presenting empirical evidence and theoretical arguments that common machine learning tasks with "natural" data exhibit EGOP spectral decay.

Paper Structure

This paper contains 56 sections, 18 theorems, 188 equations, 13 figures, 1 algorithm.

Key Result

Theorem 3

Consider a function $f:\mathbb{R}^d\rightarrow \mathbb{R}$ and a sampling distribution $\rho$ satisfying Assumptions assumption:sampling-distribution and assume:Lipschitz-Hessian. Let $\Delta_f \stackrel{\mathrm{ def}}{=} f(\theta_0) - \inf_{\theta\in \mathbb{R}^d} f(\theta)$ for some initializatio where $\operatorname{sr}_f$ denotes the stable rank (eq:def-stable-rank).

Figures (13)

  • Figure 1: (Left) Visualization of optimizing a two-dimensional log-sum-exp objective (\ref{['eq:log-sum-exp-objective']}) using Adagrad in both original coordinates and under EGOP reparameterization. In the EGOP eigenbasis, the primary directions of function variation are axis-aligned. Experimental details in \ref{['ssec:details-for-opener-cartoon']}. (Right) Negative log-likelihood loss over epochs from training a 2-layer ReLU network in 2.4k dimensions to classify handwritten digits using Adam, Adagrad, SGD, and SGD with momentum, in both original coordinates and under reparameterization. Equivariant methods (e.g. SGD) exhibit no change under reparameterization. See discussion in Section \ref{['sec:experimental-results']}.
  • Figure 2: The EGOP eigenspectrum of a 2-layer ReLU network on the UCI handwritten digits dataset. Plot shows ratio $\lambda_k/\lambda_1$ as a function of eigenvalue index $k$, indexed in decreasing order.
  • Figure 3: Training multilayer linear networks (\ref{['eq:linear-feedforward-objective']}). Both SGD and SGD with momentum are equivariant optimization methods, so their results in original and reparameterized coordinates are exactly superimposed. In \ref{['fig:linear-layers-global-reparam-valloss-vs-LR']} we consider the minimum validation loss achieved over epochs during training. Results are aggregated over 10 independent trials, with traces showing medians and shading indicating 25th-75th quartile. Asterisks indicate the learning rate used for each method in \ref{['fig:linear-layers-global-reparam-loss-vs-epochs']}. Learning rates chosen to minimize validation loss of the algorithm in original coordinates.
  • Figure 4: Block EGOP reparameterization on fashionMNIST. Results are aggregated over independent trials corresponding to different random initializations. Medians are plotted as traces, and shaded regions indicate the 25th-75th percentiles. Each algorithm (Adagrad, Adam, etc) uses the same learning rate for both coordinate systems. Full details in \ref{['sec:experimental-details']}.
  • Figure 5: Gradient Euclidean norm of solution at $t$th iterate. Learning rates were chosen to minimize loss of the algorithm in original coordinates. We induce EGOP spectral decay by choice of data matrix $A$ with singular values $\sigma_k(A) = k^{-\alpha}$. As noted in the prose, in some plots the dotted traces coincide with the solid and are thus not visible (Adagrad in \ref{['fig:log-sum-exp']}, Adam in \ref{['fig:log-sum-exp']}).
  • ...and 8 more figures

Theorems & Definitions (31)

  • Theorem 3: Informal
  • Theorem 4
  • Theorem 5
  • Lemma 6
  • Corollary 7
  • Lemma 8
  • Lemma 9
  • proof
  • Lemma 10
  • proof : Proof of Lemma \ref{['lem:Hessian-coor-wise-smooth']}
  • ...and 21 more