Table of Contents
Fetching ...

Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

Yongzhong Xu

TL;DR

The paper reveals that grokking in small transformer models trained on modular arithmetic unfolds within a low-dimensional, empirically invariant execution subspace, while loss-landscape curvature accumulates in directions orthogonal to this subspace. By introducing the commutator defect, the authors quantify transverse curvature and demonstrate that curvature growth precedes generalization, following a power-law relation with grokking time. Causal interventions show that suppressing motion along the learned subspace is necessary for grokking, whereas artificially inducing curvature is not sufficient, supporting a metastable-escape geometric picture. Across fast and slow regimes and multiple seeds, the results establish a robust, dynamical-phase view of grokking and connect optimization geometry to interpretability, regularization, and potential diagnostics for training dynamics.

Abstract

Grokking -- the delayed transition from memorization to generalization in small algorithmic tasks -- remains poorly understood. We present a geometric analysis of optimization dynamics in transformers trained on modular arithmetic. PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace, with a single principal component capturing 68-83% of trajectory variance. To probe loss-landscape geometry, we measure commutator defects -- the non-commutativity of successive gradient steps -- and project them onto this learned subspace. We find that curvature grows sharply in directions orthogonal to the execution subspace while the trajectory remains largely confined to it. Importantly, curvature growth consistently precedes generalization across learning rates and hyperparameter regimes, with the lead time obeying a power law in the grokking timescale. Causal intervention experiments show that motion along the learned subspace is necessary for grokking, while artificially increasing curvature is insufficient. Together, these results support a geometric account in which grokking reflects escape from a metastable regime characterized by low-dimensional confinement and transverse curvature accumulation. All findings replicate across this learning-rate range, a qualitatively different slow regime (lr=5e-5, wd=0.1, 3 layers), and three random seeds, though alignment dynamics differ quantitatively between regimes. Causal intervention experiments establish that orthogonal gradient flow is necessary but not sufficient for grokking: suppressing it prevents generalization with a monotonic dose-response across four operations, while artificially boosting curvature defects has no effect.

Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

TL;DR

The paper reveals that grokking in small transformer models trained on modular arithmetic unfolds within a low-dimensional, empirically invariant execution subspace, while loss-landscape curvature accumulates in directions orthogonal to this subspace. By introducing the commutator defect, the authors quantify transverse curvature and demonstrate that curvature growth precedes generalization, following a power-law relation with grokking time. Causal interventions show that suppressing motion along the learned subspace is necessary for grokking, whereas artificially inducing curvature is not sufficient, supporting a metastable-escape geometric picture. Across fast and slow regimes and multiple seeds, the results establish a robust, dynamical-phase view of grokking and connect optimization geometry to interpretability, regularization, and potential diagnostics for training dynamics.

Abstract

Grokking -- the delayed transition from memorization to generalization in small algorithmic tasks -- remains poorly understood. We present a geometric analysis of optimization dynamics in transformers trained on modular arithmetic. PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace, with a single principal component capturing 68-83% of trajectory variance. To probe loss-landscape geometry, we measure commutator defects -- the non-commutativity of successive gradient steps -- and project them onto this learned subspace. We find that curvature grows sharply in directions orthogonal to the execution subspace while the trajectory remains largely confined to it. Importantly, curvature growth consistently precedes generalization across learning rates and hyperparameter regimes, with the lead time obeying a power law in the grokking timescale. Causal intervention experiments show that motion along the learned subspace is necessary for grokking, while artificially increasing curvature is insufficient. Together, these results support a geometric account in which grokking reflects escape from a metastable regime characterized by low-dimensional confinement and transverse curvature accumulation. All findings replicate across this learning-rate range, a qualitatively different slow regime (lr=5e-5, wd=0.1, 3 layers), and three random seeds, though alignment dynamics differ quantitatively between regimes. Causal intervention experiments establish that orthogonal gradient flow is necessary but not sufficient for grokking: suppressing it prevents generalization with a monotonic dose-response across four operations, while artificially boosting curvature defects has no effect.
Paper Structure (50 sections, 7 equations, 22 figures, 2 tables)

This paper contains 50 sections, 7 equations, 22 figures, 2 tables.

Figures (22)

  • Figure 1: Weight trajectories during grokking are rank-1. (a) PC1% across operations: grokking runs (wd=1.0) show 68--83% variance in a single component. (b) Eigenspectrum showing dominant first eigenvalue.
  • Figure 2: PCA concentration is genuine and increases over training. (a) Z-scores above random-walk null. (b) Expanding-window PC1% over training.
  • Figure 3: The execution manifold exhibits empirical invariance under the optimization dynamics. Commutator defect vectors are predominantly orthogonal to the PCA subspace, with curvature confined to the normal bundle.
  • Figure 4: Random subspace control confirms that the PCA projection is geometrically structured, not a dimensionality artifact. Exec/random ratio $\approx 1.8$--$2.9\times$ across operations.
  • Figure 5: Curvature explodes during grokking but remains orthogonal to the learned subspace.
  • ...and 17 more figures