Table of Contents
Fetching ...

Grokking as a Falsifiable Finite-Size Transition

Yuda Bi, Chenyu Zhang, Qiheng Wang, Vince D Calhoun

Abstract

Grokking -- the delayed onset of generalization after early memorization -- is often described with phase-transition language, but that claim has lacked falsifiable finite-size inputs. Here we supply those inputs by treating the group order $p$ of $\mathbb{Z}_p$ as an admissible extensive variable and a held-out spectral head-tail contrast as a representation-level order parameter, then apply a condensed-matter-style diagnostic chain to coarse-grid sweeps and a dense near-critical addition audit. Binder-like crossings reveal a shared finite-size boundary, and susceptibility comparison strongly disfavors a smooth-crossover interpretation ($Δ\mathrm{AIC}=16.8$ in the near-critical audit). Phase-transition language in grokking can therefore be tested as a quantitative finite-size claim rather than invoked as analogy alone, although the transition order remains unresolved at present.

Grokking as a Falsifiable Finite-Size Transition

Abstract

Grokking -- the delayed onset of generalization after early memorization -- is often described with phase-transition language, but that claim has lacked falsifiable finite-size inputs. Here we supply those inputs by treating the group order of as an admissible extensive variable and a held-out spectral head-tail contrast as a representation-level order parameter, then apply a condensed-matter-style diagnostic chain to coarse-grid sweeps and a dense near-critical addition audit. Binder-like crossings reveal a shared finite-size boundary, and susceptibility comparison strongly disfavors a smooth-crossover interpretation ( in the near-critical audit). Phase-transition language in grokking can therefore be tested as a quantitative finite-size claim rather than invoked as analogy alone, although the transition order remains unresolved at present.

Paper Structure

This paper contains 28 sections, 10 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Raw spectral order-parameter curves for the canonical addition task. Panel A shows $m_{\mathrm{HTC}}$ versus training fraction $f$ for all $13$ coarse-grid primes; Panel B zooms the transition region around the shared crossing in this first diagnostic step. The ordering structure sharpens monotonically with $p$.
  • Figure 2: Binder-based diagnostics for the canonical addition task. Panel A: Binder-like cumulant curves cross near a common fraction. Panel B: all pairwise crossing estimates are shown, with the selected dominant branch highlighted for the drift fit; that branch shows no significant linear drift versus inverse size. Panel C: the coarse-grid Binder minimum is continuity-leaning but method-dependent, motivating the near-critical stress test in Fig. \ref{['fig:nearcrit']}A. Panel D: susceptibility model comparison favors power-law scaling over the minimal saturating null, disfavoring a smooth-crossover interpretation.
  • Figure 3: Near-critical follow-up for the addition task. (A) Binder-minimum extrapolation on both grids. The coarse grid (open symbols, $p \leq 251$) extrapolates toward zero; the near-critical grid (filled symbols, $p$ up to $397$) develops a negative trend. (B) Seed-level HTC distributions at representative near-critical points for the two largest primes ($p=307, f=0.42$ and $p=397, f=0.40$). Both distributions remain unimodal and do not provide clean evidence of first-order coexistence despite the negative Binder minimum at the largest size.
  • Figure 4: Order-parameter screening summary for context. Panel A reports time-resolved scans over common observables. Panel B compares spectral candidates and documents that HTC is retained as the representation-level choice for the finite-size diagnostic chain, with screening treated as validation rather than selection.
  • Figure 5: Robustness of the HTC head--tail cutoff in near-critical and coarse-grid data. Panel A shows the near-critical pairwise crossing behavior across $k=3,5,10$; Panel B shows the same for Binder-minimum extrapolation. $k=3$ and $k=5$ remain aligned, while $k=10$ lowers head--bulk separation at large sizes.
  • ...and 1 more figures