Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking
Yongzhong Xu
TL;DR
The paper reveals that grokking in small transformer models trained on modular arithmetic unfolds within a low-dimensional, empirically invariant execution subspace, while loss-landscape curvature accumulates in directions orthogonal to this subspace. By introducing the commutator defect, the authors quantify transverse curvature and demonstrate that curvature growth precedes generalization, following a power-law relation with grokking time. Causal interventions show that suppressing motion along the learned subspace is necessary for grokking, whereas artificially inducing curvature is not sufficient, supporting a metastable-escape geometric picture. Across fast and slow regimes and multiple seeds, the results establish a robust, dynamical-phase view of grokking and connect optimization geometry to interpretability, regularization, and potential diagnostics for training dynamics.
Abstract
Grokking -- the delayed transition from memorization to generalization in small algorithmic tasks -- remains poorly understood. We present a geometric analysis of optimization dynamics in transformers trained on modular arithmetic. PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace, with a single principal component capturing 68-83% of trajectory variance. To probe loss-landscape geometry, we measure commutator defects -- the non-commutativity of successive gradient steps -- and project them onto this learned subspace. We find that curvature grows sharply in directions orthogonal to the execution subspace while the trajectory remains largely confined to it. Importantly, curvature growth consistently precedes generalization across learning rates and hyperparameter regimes, with the lead time obeying a power law in the grokking timescale. Causal intervention experiments show that motion along the learned subspace is necessary for grokking, while artificially increasing curvature is insufficient. Together, these results support a geometric account in which grokking reflects escape from a metastable regime characterized by low-dimensional confinement and transverse curvature accumulation. All findings replicate across this learning-rate range, a qualitatively different slow regime (lr=5e-5, wd=0.1, 3 layers), and three random seeds, though alignment dynamics differ quantitatively between regimes. Causal intervention experiments establish that orthogonal gradient flow is necessary but not sufficient for grokking: suppressing it prevents generalization with a monotonic dose-response across four operations, while artificially boosting curvature defects has no effect.
