Table of Contents
Fetching ...

Grokking From Abstraction to Intelligence

Junjie Zhang, Zhen Shen, Gang Xiong, Xisong Dong

Abstract

Grokking in modular arithmetic has established itself as the quintessential fruit fly experiment, serving as a critical domain for investigating the mechanistic origins of model generalization. Despite its significance, existing research remains narrowly focused on specific local circuits or optimization tuning, largely overlooking the global structural evolution that fundamentally drives this phenomenon. We propose that grokking originates from a spontaneous simplification of internal model structures governed by the principle of parsimony. We integrate causal, spectral, and algorithmic complexity measures alongside Singular Learning Theory to reveal that the transition from memorization to generalization corresponds to the physical collapse of redundant manifolds and deep information compression, offering a novel perspective for understanding the mechanisms of model overfitting and generalization.

Grokking From Abstraction to Intelligence

Abstract

Grokking in modular arithmetic has established itself as the quintessential fruit fly experiment, serving as a critical domain for investigating the mechanistic origins of model generalization. Despite its significance, existing research remains narrowly focused on specific local circuits or optimization tuning, largely overlooking the global structural evolution that fundamentally drives this phenomenon. We propose that grokking originates from a spontaneous simplification of internal model structures governed by the principle of parsimony. We integrate causal, spectral, and algorithmic complexity measures alongside Singular Learning Theory to reveal that the transition from memorization to generalization corresponds to the physical collapse of redundant manifolds and deep information compression, offering a novel perspective for understanding the mechanisms of model overfitting and generalization.

Paper Structure

This paper contains 27 sections, 10 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Grokking on modular arithmetic task $(x-y)$ mod 97.
  • Figure 2: CMA for architectural evolution during emergence and ablation experiment for attention heads. Sub-figures (a/b/c/d) represent modular arithmetic tasks $(x \circ y)\bmod 97$ with $\circ\in\{+,-,\times,\div\}$.
  • Figure 3: The emergence of generalization is mechanistically characterized by (a) the transition to a group-theoretic structure and (b) the spectral localization of weight energy in the Fourier domain.
  • Figure 4: The curve tracks the global BDM complexity of the model parameters throughout training. During the emergence phase, the algorithmic complexity of the model weights drops sharply, and the internal parameter structure becomes noticeably block‑structured, with different colors indicating discretized parameters after quantization.
  • Figure 5: Left: Generalization (green) aligns with a collapse of geometric-complexity proxies (red). Right: Spectral weights transition from high-entropy noise to sparse, low-dimensional structured patterns via the Occam Gate (diagonal-like structure is most directly expected for addition/subtraction; other operations may require reindexing such as discrete-log ordering).
  • ...and 11 more figures