Table of Contents
Fetching ...

Early-Warning Signals of Grokking via Loss-Landscape Geometry

Yongzhong Xu

TL;DR

The commutator defect is identified as a robust, architecture-agnostic, causally implicated early-warning signal for delayed generalization in transformers, establishing necessity as a universal finding.

Abstract

Grokking -- the abrupt transition from memorization to generalization after prolonged training -- has been linked to confinement on low-dimensional execution manifolds in modular arithmetic. Whether this mechanism extends beyond arithmetic remains open. We study two sequence-learning benchmarks: SCAN compositional generalization and Dyck-1 depth prediction. Across both tasks and a wide range of learning rates, the commutator defect -- a curvature measure derived from non-commuting gradient updates -- rises well before generalization, with lead times following a superlinear power law (alpha approximately 1.18 for SCAN, approximately 1.13 for Dyck), consistent with prior results on modular arithmetic. Weight-space PCA reveals that spectral concentration is not a universal precursor; the commutator defect is. Causal interventions demonstrate a mechanistic role: amplifying non-commutativity accelerates grokking (roughly 32% on SCAN, roughly 50% on Dyck), while suppressing orthogonal gradient flow delays or prevents it. The three task families form a spectrum of causal sensitivity -- modular arithmetic is rigid, Dyck is responsive, SCAN is intermediate -- yet suppression delays or prevents grokking in all cases, establishing necessity as a universal finding. These results identify the commutator defect as a robust, architecture-agnostic, causally implicated early-warning signal for delayed generalization in transformers.

Early-Warning Signals of Grokking via Loss-Landscape Geometry

TL;DR

The commutator defect is identified as a robust, architecture-agnostic, causally implicated early-warning signal for delayed generalization in transformers, establishing necessity as a universal finding.

Abstract

Grokking -- the abrupt transition from memorization to generalization after prolonged training -- has been linked to confinement on low-dimensional execution manifolds in modular arithmetic. Whether this mechanism extends beyond arithmetic remains open. We study two sequence-learning benchmarks: SCAN compositional generalization and Dyck-1 depth prediction. Across both tasks and a wide range of learning rates, the commutator defect -- a curvature measure derived from non-commuting gradient updates -- rises well before generalization, with lead times following a superlinear power law (alpha approximately 1.18 for SCAN, approximately 1.13 for Dyck), consistent with prior results on modular arithmetic. Weight-space PCA reveals that spectral concentration is not a universal precursor; the commutator defect is. Causal interventions demonstrate a mechanistic role: amplifying non-commutativity accelerates grokking (roughly 32% on SCAN, roughly 50% on Dyck), while suppressing orthogonal gradient flow delays or prevents it. The three task families form a spectrum of causal sensitivity -- modular arithmetic is rigid, Dyck is responsive, SCAN is intermediate -- yet suppression delays or prevents grokking in all cases, establishing necessity as a universal finding. These results identify the commutator defect as a robust, architecture-agnostic, causally implicated early-warning signal for delayed generalization in transformers.
Paper Structure (65 sections, 9 equations, 13 figures, 7 tables)

This paper contains 65 sections, 9 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: SCAN: Defect predicts grokking across learning rates. Commutator defect (solid lines, left axis, log scale) and test sequence accuracy (dashed lines, right axis) for seed 42 at five learning rates. Dotted vertical lines mark defect onset; dash-dot lines mark grokking step. At every learning rate, the defect spike precedes the accuracy transition. Compare with Figure 10 in xu2026integrability for the analogous modular arithmetic result.
  • Figure 2: Dyck: Defect predicts grokking across learning rates. Same format as \ref{['fig:scan_hero']} but for the Dyck depth prediction task at six learning rates ($3{\times}10^{-5}$ through $10^{-2}$). The defect onset precedes grokking in all cases. At $\eta = 10^{-2}$, the non-monotonicity (longer grokking time than $\eta = 10^{-3}$) is visible, driven by accuracy oscillation.
  • Figure 3: SCAN: Lead time scaling law. (A) Lead time vs. learning rate. (B) Lead fraction vs. learning rate. (C) Defect onset step vs. grok step with power-law fit ($\alpha \approx 1.18$, $R^2 = 0.990$, $n = 11$). Colored points show individual seeds; black diamonds show LR means. Compare with Figure 11 in xu2026integrability ($\alpha = 1.27$, $R^2 = 0.97$, $n = 43$).
  • Figure 4: Dyck: Lead time scaling law. Same format as \ref{['fig:scan_scaling']} but for Dyck ($\alpha \approx 1.13$, $R^2 = 0.908$, $n = 14$). Fifteen runs across 6 learning rates and 3 seeds show positive lead time. All three task families (this work + modular arithmetic) exhibit super-linear scaling ($\alpha > 1$).
  • Figure 5: PC1 trajectory dissociation. (a) On SCAN, PC1 variance continues increasing through the grokking step (red dashed line), with no turnover before grokking. (b) On Dyck, PC1 turnover (green dotted line) occurs well before grokking, with a lead of 2,300 steps. This contrasts with modular arithmetic xu2026integrability, where PC1 turnover precedes grokking (matching Dyck). The commutator defect precedes grokking on all three tasks, making it a more universal signal.
  • ...and 8 more figures

Theorems & Definitions (4)

  • Definition 1: Execution manifold
  • Definition 2: Commutator defect
  • Definition 3: Invariance measure
  • Definition 4: Transverse decoupling