Table of Contents
Fetching ...

Exploring Grokking: Experimental and Mechanistic Investigations

Hu Qiye, Zhou Hao, Yu RuoXi

TL;DR

This work investigates grokking, the late-emergent generalization observed in over-parameterized networks, by performing controlled experiments on modular addition with varying data fractions, model types, and optimization settings. It finds that grokking is most pronounced around a training data fraction of $\alpha \approx 0.5$ for a $p=97$ modular arithmetic task, with Transformer models showing the phenomenon while MLPs and LSTMs do not under the tested encoding; weight decay via AdamW substantially improves generalization. The authors synthesize two mechanisms from prior literature—structured representations and implicit biases—as drivers of grokking, and discuss a Goldilocks-zone of hyperparameters that fosters the emergence of robust, generalizable representations. The study provides insights into training dynamics and data efficiency, suggesting how initialization, regularization, and encoding choices can influence when and how rapid generalization occurs. Overall, the work advances understanding of the conditions under which grokking arises and how to harness or mitigate it in practical training regimes.

Abstract

The phenomenon of grokking in over-parameterized neural networks has garnered significant interest. It involves the neural network initially memorizing the training set with zero training error and near-random test error. Subsequent prolonged training leads to a sharp transition from no generalization to perfect generalization. Our study comprises extensive experiments and an exploration of the research behind the mechanism of grokking. Through experiments, we gained insights into its behavior concerning the training data fraction, the model, and the optimization. The mechanism of grokking has been a subject of various viewpoints proposed by researchers, and we introduce some of these perspectives.

Exploring Grokking: Experimental and Mechanistic Investigations

TL;DR

This work investigates grokking, the late-emergent generalization observed in over-parameterized networks, by performing controlled experiments on modular addition with varying data fractions, model types, and optimization settings. It finds that grokking is most pronounced around a training data fraction of for a modular arithmetic task, with Transformer models showing the phenomenon while MLPs and LSTMs do not under the tested encoding; weight decay via AdamW substantially improves generalization. The authors synthesize two mechanisms from prior literature—structured representations and implicit biases—as drivers of grokking, and discuss a Goldilocks-zone of hyperparameters that fosters the emergence of robust, generalizable representations. The study provides insights into training dynamics and data efficiency, suggesting how initialization, regularization, and encoding choices can influence when and how rapid generalization occurs. Overall, the work advances understanding of the conditions under which grokking arises and how to harness or mitigate it in practical training regimes.

Abstract

The phenomenon of grokking in over-parameterized neural networks has garnered significant interest. It involves the neural network initially memorizing the training set with zero training error and near-random test error. Subsequent prolonged training leads to a sharp transition from no generalization to perfect generalization. Our study comprises extensive experiments and an exploration of the research behind the mechanism of grokking. Through experiments, we gained insights into its behavior concerning the training data fraction, the model, and the optimization. The mechanism of grokking has been a subject of various viewpoints proposed by researchers, and we introduce some of these perspectives.

Paper Structure

This paper contains 16 sections, 3 equations, 8 figures.

Figures (8)

  • Figure 1: Comparison of accuracy at different training data fractions with mod 97. The blue lines for training and the red lines for validation.
  • Figure 2: Comparison of accuracy at different training data fractions. The blue lines for training and the red lines for validation.
  • Figure 3: Comparison of accuracy at different training data fractions with MLP model. The blue lines for training and the red lines for validation.
  • Figure 4: Comparison of accuracy at different training data fractions with LSTM model. The blue lines for training and the red lines for validation.
  • Figure 5: Different optimization algorithms lead to different amount of generalization within an optimization budget of 1800 steps for the problem of (x+y) mod 97. Weight decay, i.e. AdamW, improves generalization the most, but some generalization happens even with full batch optimizers and models without weight decay or activation noise at high percentages of training data fraction. Suboptimal choice hyperparameters severely limit generalization. Note this: training accuracy achieved after approximately 100-200 updates for all optimization methods and training data fractions.
  • ...and 3 more figures