Exploring Grokking: Experimental and Mechanistic Investigations
Hu Qiye, Zhou Hao, Yu RuoXi
TL;DR
This work investigates grokking, the late-emergent generalization observed in over-parameterized networks, by performing controlled experiments on modular addition with varying data fractions, model types, and optimization settings. It finds that grokking is most pronounced around a training data fraction of $\alpha \approx 0.5$ for a $p=97$ modular arithmetic task, with Transformer models showing the phenomenon while MLPs and LSTMs do not under the tested encoding; weight decay via AdamW substantially improves generalization. The authors synthesize two mechanisms from prior literature—structured representations and implicit biases—as drivers of grokking, and discuss a Goldilocks-zone of hyperparameters that fosters the emergence of robust, generalizable representations. The study provides insights into training dynamics and data efficiency, suggesting how initialization, regularization, and encoding choices can influence when and how rapid generalization occurs. Overall, the work advances understanding of the conditions under which grokking arises and how to harness or mitigate it in practical training regimes.
Abstract
The phenomenon of grokking in over-parameterized neural networks has garnered significant interest. It involves the neural network initially memorizing the training set with zero training error and near-random test error. Subsequent prolonged training leads to a sharp transition from no generalization to perfect generalization. Our study comprises extensive experiments and an exploration of the research behind the mechanism of grokking. Through experiments, we gained insights into its behavior concerning the training data fraction, the model, and the optimization. The mechanism of grokking has been a subject of various viewpoints proposed by researchers, and we introduce some of these perspectives.
