Table of Contents
Fetching ...

Grokking as Dimensional Phase Transition in Neural Networks

Ping Wang

Abstract

Neural network grokking -- the abrupt memorization-to-generalization transition -- challenges our understanding of learning dynamics. Through finite-size scaling of gradient avalanche dynamics across eight model scales, we find that grokking is a \textit{dimensional phase transition}: effective dimensionality~$D$ crosses from sub-diffusive (subcritical, $D < 1$) to super-diffusive (supercritical, $D > 1$) at generalization onset, exhibiting self-organized criticality (SOC). Crucially, $D$ reflects \textbf{gradient field geometry}, not network architecture: synthetic i.i.d.\ Gaussian gradients maintain $D \approx 1$ regardless of graph topology, while real training exhibits dimensional excess from backpropagation correlations. The grokking-localized $D(t)$ crossing -- robust across topologies -- offers new insight into the trainability of overparameterized networks.

Grokking as Dimensional Phase Transition in Neural Networks

Abstract

Neural network grokking -- the abrupt memorization-to-generalization transition -- challenges our understanding of learning dynamics. Through finite-size scaling of gradient avalanche dynamics across eight model scales, we find that grokking is a \textit{dimensional phase transition}: effective dimensionality~ crosses from sub-diffusive (subcritical, ) to super-diffusive (supercritical, ) at generalization onset, exhibiting self-organized criticality (SOC). Crucially, reflects \textbf{gradient field geometry}, not network architecture: synthetic i.i.d.\ Gaussian gradients maintain regardless of graph topology, while real training exhibits dimensional excess from backpropagation correlations. The grokking-localized crossing -- robust across topologies -- offers new insight into the trainability of overparameterized networks.

Paper Structure

This paper contains 1 equation, 3 figures.

Figures (3)

  • Figure 1: Grokking as transient SOC and dimensional phase transition.(a) Training (blue) and evaluation (purple) accuracies for representative XOR case (h=21, N=85; train and evaluation share the same four patterns), showing a synchronized abrupt transition at epoch 27. Inset: multi-scale analysis across h=20--500 reveals scale-dependent grokking timing spanning epochs 12--134. (b) Time-resolved FSS analysis shows effective dimensionality $D$ evolves continuously during training. Yellow region: multi-scale grokking window. Orange line: single-scale grokking. Red line: time-averaged $D = 1.00 \pm 0.02$. Inset: FSS fit quality $R^2 > 0.98$. (c) Representative example: Weight concentration (Gini coefficient of $|\boldsymbol{\theta}|$; teal) exhibits transient peak coinciding with grokking. Multi-seed statistical validation (1000 seeds) described in text.
  • Figure 2: Finite-size scaling analysis of avalanche dynamics.(a) Complementary cumulative distributions (CCDF) of avalanche sizes across eight model scales ($h = 20$--$500$), showing heavy-tailed, scale-dependent behavior with systematic cutoff growth. (b) X-only data collapse: plotting $P(>s)$ vs $s/N^D$ collapses all scales toward a common curve using a single exponent $D$, validating the FSS exponent without additional fitting parameters. (c) FSS of maximum ($s_{\max} \sim N^D$, left axis) and mean ($\langle s \rangle \sim N^\gamma$, right axis) avalanche sizes, yielding $D = 1.00 \pm 0.02$ ($R^2 = 1.00$) and $\gamma = 1.15 \pm 0.06$ ($R^2 = 0.99$) across eight scales.
  • Figure 3: Gradient Geometry Determines Dimensionality.(a) Bootstrap distributions (10,000 resamples) of the FSS exponent $D$, where each run is phase-split at its own grokking epoch: pre-grokking real gradients (green, $D = 0.90 \pm 0.02$, sub-diffusive), post-grokking real gradients (red, $D = 1.20 \pm 0.02$, super-diffusive), and synthetic i.i.d. Gaussian gradients (blue, $D = 0.99 \pm 0.01$). Three non-overlapping peaks confirm statistically distinct scaling regimes. (b) Leave-One-Out FSS analysis: removing any single scale preserves $D$, confirming scale invariance across $N = 81$--$2001$. Inset: Five network topologies collapse to $D \approx 0.99$ for synthetic gradients, demonstrating topology invariance.