Early-Warning Signals of Grokking via Loss-Landscape Geometry

Yongzhong Xu

Early-Warning Signals of Grokking via Loss-Landscape Geometry

Yongzhong Xu

TL;DR

The commutator defect is identified as a robust, architecture-agnostic, causally implicated early-warning signal for delayed generalization in transformers, establishing necessity as a universal finding.

Abstract

Grokking -- the abrupt transition from memorization to generalization after prolonged training -- has been linked to confinement on low-dimensional execution manifolds in modular arithmetic. Whether this mechanism extends beyond arithmetic remains open. We study two sequence-learning benchmarks: SCAN compositional generalization and Dyck-1 depth prediction. Across both tasks and a wide range of learning rates, the commutator defect -- a curvature measure derived from non-commuting gradient updates -- rises well before generalization, with lead times following a superlinear power law (alpha approximately 1.18 for SCAN, approximately 1.13 for Dyck), consistent with prior results on modular arithmetic. Weight-space PCA reveals that spectral concentration is not a universal precursor; the commutator defect is. Causal interventions demonstrate a mechanistic role: amplifying non-commutativity accelerates grokking (roughly 32% on SCAN, roughly 50% on Dyck), while suppressing orthogonal gradient flow delays or prevents it. The three task families form a spectrum of causal sensitivity -- modular arithmetic is rigid, Dyck is responsive, SCAN is intermediate -- yet suppression delays or prevents grokking in all cases, establishing necessity as a universal finding. These results identify the commutator defect as a robust, architecture-agnostic, causally implicated early-warning signal for delayed generalization in transformers.

Early-Warning Signals of Grokking via Loss-Landscape Geometry

TL;DR

Abstract

Paper Structure (65 sections, 9 equations, 13 figures, 7 tables)

This paper contains 65 sections, 9 equations, 13 figures, 7 tables.

Introduction
Key contributions.
Paper outline.
Background: Geometric Framework for Grokking
Grokking
The Execution Manifold
The Commutator Defect
Geometric interpretation.
Manifold Projection and Invariance
Random Subspace Control
Scaling Law
Experimental Setup
Tasks and Datasets
SCAN Compositional Generalization.
Dyck-1 Depth Prediction.
...and 50 more sections

Figures (13)

Figure 1: SCAN: Defect predicts grokking across learning rates. Commutator defect (solid lines, left axis, log scale) and test sequence accuracy (dashed lines, right axis) for seed 42 at five learning rates. Dotted vertical lines mark defect onset; dash-dot lines mark grokking step. At every learning rate, the defect spike precedes the accuracy transition. Compare with Figure 10 in xu2026integrability for the analogous modular arithmetic result.
Figure 2: Dyck: Defect predicts grokking across learning rates. Same format as \ref{['fig:scan_hero']} but for the Dyck depth prediction task at six learning rates ($3{\times}10^{-5}$ through $10^{-2}$). The defect onset precedes grokking in all cases. At $\eta = 10^{-2}$, the non-monotonicity (longer grokking time than $\eta = 10^{-3}$) is visible, driven by accuracy oscillation.
Figure 3: SCAN: Lead time scaling law. (A) Lead time vs. learning rate. (B) Lead fraction vs. learning rate. (C) Defect onset step vs. grok step with power-law fit ($\alpha \approx 1.18$, $R^2 = 0.990$, $n = 11$). Colored points show individual seeds; black diamonds show LR means. Compare with Figure 11 in xu2026integrability ($\alpha = 1.27$, $R^2 = 0.97$, $n = 43$).
Figure 4: Dyck: Lead time scaling law. Same format as \ref{['fig:scan_scaling']} but for Dyck ($\alpha \approx 1.13$, $R^2 = 0.908$, $n = 14$). Fifteen runs across 6 learning rates and 3 seeds show positive lead time. All three task families (this work + modular arithmetic) exhibit super-linear scaling ($\alpha > 1$).
Figure 5: PC1 trajectory dissociation. (a) On SCAN, PC1 variance continues increasing through the grokking step (red dashed line), with no turnover before grokking. (b) On Dyck, PC1 turnover (green dotted line) occurs well before grokking, with a lead of 2,300 steps. This contrasts with modular arithmetic xu2026integrability, where PC1 turnover precedes grokking (matching Dyck). The commutator defect precedes grokking on all three tasks, making it a more universal signal.
...and 8 more figures

Theorems & Definitions (4)

Definition 1: Execution manifold
Definition 2: Commutator defect
Definition 3: Invariance measure
Definition 4: Transverse decoupling

Early-Warning Signals of Grokking via Loss-Landscape Geometry

TL;DR

Abstract

Early-Warning Signals of Grokking via Loss-Landscape Geometry

Authors

TL;DR

Abstract

Table of Contents

Figures (13)

Theorems & Definitions (4)