Grokking vs. Learning: Same Features, Different Encodings
Dmitry Manning-Coe, Jacopo Gliozzi, Alexander G. Stapleton, Edward Hirst, Giuseppe De Tomasi, Barry Bradlyn, David S. Berman
TL;DR
This work probes whether grokking and ordinary steady training yield fundamentally different models by comparing their learned features, compressibility, and training dynamics on two interpretable tasks (Ising-model phase classification and modular addition). It shows that both paths converge on the same features (energy and magnetization for Ising; Fourier coefficients for modular addition), but encodings differ, with a novel compressive regime appearing only in steady learning, enabling substantial model compression. The authors introduce information-geometric measures based on the Fisher Information Metric to analyze training trajectories, uncovering task-dependent pre-grokking dynamics and, in at least one task, a near-straight-line trajectory in information space for grokking. These findings illuminate when grokking offers practical benefits and suggest a framework for interpreting and compressing models via geometry-driven diagnostics.
Abstract
Grokking typically achieves similar loss to ordinary, "steady", learning. We ask whether these different learning paths - grokking versus ordinary training - lead to fundamental differences in the learned models. To do so we compare the features, compressibility, and learning dynamics of models trained via each path in two tasks. We find that grokked and steadily trained models learn the same features, but there can be large differences in the efficiency with which these features are encoded. In particular, we find a novel "compressive regime" of steady training in which there emerges a linear trade-off between model loss and compressibility, and which is absent in grokking. In this regime, we can achieve compression factors 25x times the base model, and 5x times the compression achieved in grokking. We then track how model features and compressibility develop through training. We show that model development in grokking is task-dependent, and that peak compressibility is achieved immediately after the grokking plateau. Finally, novel information-geometric measures are introduced which demonstrate that models undergoing grokking follow a straight path in information space.
