Understanding Adam Requires Better Rotation Dependent Assumptions
Tianyue H. Zhang, Lucas Maes, Alan Milligan, Alexia Jolicoeur-Martineau, Ioannis Mitliagkas, Damien Scieur, Simon Lacoste-Julien, Charles Guille-Escuret
TL;DR
The paper reveals that Adam’s empirical gains over SGD are not captured by rotation-invariant theories, showing that performance deteriorates when the parameter-space basis is randomly rotated. It demonstrates that structured, SVD-based rotations can improve optimization in transformers like GPT-2 and ViT, and that the orthogonality of layer updates correlates with these gains, paralleling ideas from the Muon optimizer. By systematically testing rotation scopes and critiquing common rotation-dependent assumptions (L-infinity bounds, Hessian block structure, and smoothness), the work identifies update orthogonality as a promising axis for rotation-aware theory. The findings suggest new directions for principled, basis-aware optimization methods and underline the need for rotation-sensitive analyses to explain and leverage Adam’s practical advantages.
Abstract
Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's sensitivity to rotations of the parameter space. We observe that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis in practice. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature and find that they fall short in explaining Adam's behaviour across various rotation types. In contrast, we verify the orthogonality of the update as a promising indicator of Adam's basis sensitivity, suggesting it may be the key quantity for developing rotation-dependent theoretical frameworks that better explain its empirical success.
