Table of Contents
Fetching ...

Understanding Adam Requires Better Rotation Dependent Assumptions

Tianyue H. Zhang, Lucas Maes, Alan Milligan, Alexia Jolicoeur-Martineau, Ioannis Mitliagkas, Damien Scieur, Simon Lacoste-Julien, Charles Guille-Escuret

TL;DR

The paper reveals that Adam’s empirical gains over SGD are not captured by rotation-invariant theories, showing that performance deteriorates when the parameter-space basis is randomly rotated. It demonstrates that structured, SVD-based rotations can improve optimization in transformers like GPT-2 and ViT, and that the orthogonality of layer updates correlates with these gains, paralleling ideas from the Muon optimizer. By systematically testing rotation scopes and critiquing common rotation-dependent assumptions (L-infinity bounds, Hessian block structure, and smoothness), the work identifies update orthogonality as a promising axis for rotation-aware theory. The findings suggest new directions for principled, basis-aware optimization methods and underline the need for rotation-sensitive analyses to explain and leverage Adam’s practical advantages.

Abstract

Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's sensitivity to rotations of the parameter space. We observe that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis in practice. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature and find that they fall short in explaining Adam's behaviour across various rotation types. In contrast, we verify the orthogonality of the update as a promising indicator of Adam's basis sensitivity, suggesting it may be the key quantity for developing rotation-dependent theoretical frameworks that better explain its empirical success.

Understanding Adam Requires Better Rotation Dependent Assumptions

TL;DR

The paper reveals that Adam’s empirical gains over SGD are not captured by rotation-invariant theories, showing that performance deteriorates when the parameter-space basis is randomly rotated. It demonstrates that structured, SVD-based rotations can improve optimization in transformers like GPT-2 and ViT, and that the orthogonality of layer updates correlates with these gains, paralleling ideas from the Muon optimizer. By systematically testing rotation scopes and critiquing common rotation-dependent assumptions (L-infinity bounds, Hessian block structure, and smoothness), the work identifies update orthogonality as a promising axis for rotation-aware theory. The findings suggest new directions for principled, basis-aware optimization methods and underline the need for rotation-sensitive analyses to explain and leverage Adam’s practical advantages.

Abstract

Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's sensitivity to rotations of the parameter space. We observe that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis in practice. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature and find that they fall short in explaining Adam's behaviour across various rotation types. In contrast, we verify the orthogonality of the update as a promising indicator of Adam's basis sensitivity, suggesting it may be the key quantity for developing rotation-dependent theoretical frameworks that better explain its empirical success.

Paper Structure

This paper contains 46 sections, 1 theorem, 25 equations, 34 figures, 2 tables, 6 algorithms.

Key Result

Proposition 1

Stochastic Gradient Descent with momentum is rotation-equivariant.

Figures (34)

  • Figure 1: Adam's performance degrades under certain random rotations of the parameter space, demonstrating its dependence on the standard basis. (a) For GPT2, global rotations lead to a $16\%$ slowdown in training. (b) ViT experiences a more dramatic $96\%$ slowdown under global rotations. Performance is preserved under output-wise rotations but progressively worsens with input-wise, layer-wise, and global rotations, revealing Adam's increasing sensitivity to coordinate changes of broader scopes. Experimental details are provided in \ref{['subsec:main_exp']}.
  • Figure 2: Trajectories of SGD-M and Adam on a quadratic under two different rotations. SGD-M maintains the same trajectory up to rotation; Adam does not.
  • Figure 3: Methodology to train neural networks under parameter space rotations. (i) Forward and backward passes in the standard space to retrieve the gradients. (ii) The gradients are rotated using $\mathbf{R}$. (iii) Adam receives the rotated gradients and produces an update $\Delta \mathbf{w}^{(\mathbf{R})}$ in the rotated space. (iv) $\Delta \mathbf{w}^{(\mathbf{R})}$ is rotated back to the original space using $\mathbf{R}^\top$. (v) The parameters are updated with $\mathbf{R}^\top \Delta \mathbf{w}^{(\mathbf{R})}$.
  • Figure 4: Illustration of different rotation scopes for a model with weights $\mathcal{W} \overset{\Delta}{=} \{\mathbf{W}_1, \mathbf{W}_2, \mathbf{W}_3\}$. Global rotation rotates the entire parameter space at once, layer-wise only performs rotations within each layer subspace, and input-wise (resp. output-wise) rotates within the weights originating from a same input neuron (resp. leading to a same output neuron).
  • Figure 5: Performance of GPT2 trained with Adam in SVD-rotated space, without rotations, with random output-wise rotation and with random global rotation. The rotations computed with SVD lead to sizeable improvement.
  • ...and 29 more figures

Theorems & Definitions (4)

  • Definition 1: Rotational equivariance
  • Proposition 1
  • Definition 2
  • proof : Proof of \ref{['prop:sgd']}