Grokking Explained: A Statistical Phenomenon
Breno W. Carvalho, Artur S. d'Avila Garcez, Luís C. Lamb, Emílio Vital Brazil
TL;DR
This work investigates grokking, the delayed generalization phenomenon where test performance suddenly improves after training loss has converged, and argues that distribution shift between training and test data is the key driver. It introduces two synthetic class-hierarchy datasets—equidistant-subclass and equivariant-subclass—for controlled distribution shifts, and validates the mechanism via MNIST clustering in a ResNet latent space. Across MLP and Transformer architectures, results show grokking is largely architecture-agnostic and can arise with dense data and minimal hyperparameter tuning. The study provides benchmarks and insights into relational data structure, advocating new stopping criteria that account for late generalization and guiding future research on sample-efficient, relationally informed learning.
Abstract
Grokking, or delayed generalization, is an intriguing learning phenomenon where test set loss decreases sharply only after a model's training set loss has converged. This challenges conventional understanding of the training dynamics in deep learning networks. In this paper, we formalize and investigate grokking, highlighting that a key factor in its emergence is a distribution shift between training and test data. We introduce two synthetic datasets specifically designed to analyze grokking. One dataset examines the impact of limited sampling, and the other investigates transfer learning's role in grokking. By inducing distribution shifts through controlled imbalanced sampling of sub-categories, we systematically reproduce the phenomenon, demonstrating that while small-sampling is strongly associated with grokking, it is not its cause. Instead, small-sampling serves as a convenient mechanism for achieving the necessary distribution shift. We also show that when classes form an equivariant map, grokking can be explained by the model's ability to learn from similar classes or sub-categories. Unlike earlier work suggesting that grokking primarily arises from high regularization and sparse data, we demonstrate that it can also occur with dense data and minimal hyper-parameter tuning. Our findings deepen the understanding of grokking and pave the way for developing better stopping criteria in future training processes.
