Table of Contents
Fetching ...

Grokking Explained: A Statistical Phenomenon

Breno W. Carvalho, Artur S. d'Avila Garcez, Luís C. Lamb, Emílio Vital Brazil

TL;DR

This work investigates grokking, the delayed generalization phenomenon where test performance suddenly improves after training loss has converged, and argues that distribution shift between training and test data is the key driver. It introduces two synthetic class-hierarchy datasets—equidistant-subclass and equivariant-subclass—for controlled distribution shifts, and validates the mechanism via MNIST clustering in a ResNet latent space. Across MLP and Transformer architectures, results show grokking is largely architecture-agnostic and can arise with dense data and minimal hyperparameter tuning. The study provides benchmarks and insights into relational data structure, advocating new stopping criteria that account for late generalization and guiding future research on sample-efficient, relationally informed learning.

Abstract

Grokking, or delayed generalization, is an intriguing learning phenomenon where test set loss decreases sharply only after a model's training set loss has converged. This challenges conventional understanding of the training dynamics in deep learning networks. In this paper, we formalize and investigate grokking, highlighting that a key factor in its emergence is a distribution shift between training and test data. We introduce two synthetic datasets specifically designed to analyze grokking. One dataset examines the impact of limited sampling, and the other investigates transfer learning's role in grokking. By inducing distribution shifts through controlled imbalanced sampling of sub-categories, we systematically reproduce the phenomenon, demonstrating that while small-sampling is strongly associated with grokking, it is not its cause. Instead, small-sampling serves as a convenient mechanism for achieving the necessary distribution shift. We also show that when classes form an equivariant map, grokking can be explained by the model's ability to learn from similar classes or sub-categories. Unlike earlier work suggesting that grokking primarily arises from high regularization and sparse data, we demonstrate that it can also occur with dense data and minimal hyper-parameter tuning. Our findings deepen the understanding of grokking and pave the way for developing better stopping criteria in future training processes.

Grokking Explained: A Statistical Phenomenon

TL;DR

This work investigates grokking, the delayed generalization phenomenon where test performance suddenly improves after training loss has converged, and argues that distribution shift between training and test data is the key driver. It introduces two synthetic class-hierarchy datasets—equidistant-subclass and equivariant-subclass—for controlled distribution shifts, and validates the mechanism via MNIST clustering in a ResNet latent space. Across MLP and Transformer architectures, results show grokking is largely architecture-agnostic and can arise with dense data and minimal hyperparameter tuning. The study provides benchmarks and insights into relational data structure, advocating new stopping criteria that account for late generalization and guiding future research on sample-efficient, relationally informed learning.

Abstract

Grokking, or delayed generalization, is an intriguing learning phenomenon where test set loss decreases sharply only after a model's training set loss has converged. This challenges conventional understanding of the training dynamics in deep learning networks. In this paper, we formalize and investigate grokking, highlighting that a key factor in its emergence is a distribution shift between training and test data. We introduce two synthetic datasets specifically designed to analyze grokking. One dataset examines the impact of limited sampling, and the other investigates transfer learning's role in grokking. By inducing distribution shifts through controlled imbalanced sampling of sub-categories, we systematically reproduce the phenomenon, demonstrating that while small-sampling is strongly associated with grokking, it is not its cause. Instead, small-sampling serves as a convenient mechanism for achieving the necessary distribution shift. We also show that when classes form an equivariant map, grokking can be explained by the model's ability to learn from similar classes or sub-categories. Unlike earlier work suggesting that grokking primarily arises from high regularization and sparse data, we demonstrate that it can also occur with dense data and minimal hyper-parameter tuning. Our findings deepen the understanding of grokking and pave the way for developing better stopping criteria in future training processes.

Paper Structure

This paper contains 20 sections, 3 equations, 4 figures.

Figures (4)

  • Figure 1: Schematic representation of the two synthetic datasets used in this study. The equidistant subclass dataset (a) and the equivariant subclass dataset (b).
  • Figure 2: Grokking example: assume that the training set performance has converged ($T_R$, orange curve). There is an inflection point $\alpha_0$ where the model's test set performance ($T_E$, blue curve) escapes a low performing region. The interval between $\alpha_0$ and $\alpha_1$ is the transition phase during which the test set performance transitions to a region much closer to that of the training set performance. By $\Delta_S$, we denote the difference between the upper limit of the low performing region $\sigma_S$ and the lower limit of the high performing region $\sigma_E$. $\Delta_S$ is used to indicate how much grokking happened in this training process. By $\Delta_T$, we denote the difference between the limit value of $T_R$ and $\sigma_E$. $\Delta_T$ can be used to indicate how much room for improvement there is in the learning process. The interval $S$ is the support interval. The longer this interval is without the value of $T_E$ dropping below $\sigma_E$, the more stable the grokking phenomenon.
  • Figure 3: Average test set curves after the training loss has converged to a very high training set accuracy in the equidistant dataset 1. We examine the effect of small sample sizes (\ref{['fig:balanced_data_varying_sample_size']}), the effect of an extreme data shift against keeping a few samples from the original distribution (\ref{['fig:small_unbalance_balance_eq']}), and finally the occurrence of grokking with larger sample sizes, which was assumed not to take place until now (\ref{['fig:medium_unbalance_balance']}).
  • Figure 4: Investigation of grokking in the equivariant dataset 2. Even without any samples from specific subclasses (Figure \ref{['fig:equivariant-no-sample']}), this time the model was able to generalize (differently from the results shown in Fig. \ref{['fig:small_unbalance_balance_eq']} for the equidistant dataset 1). This must be due to the similarity, i.e. the proximity in representation space, between the subclasses of a class in dataset 2. In Figure \ref{['fig:equivariant-few-samples']}, we see the expected grokking behavior take place both due to the very small sample size that shifts the distribution and the corresponding subclass imbalance. As before, a small sample size increases grokking intensity, but grokking is still present with a larger sample size.

Theorems & Definitions (1)

  • Definition 3.1: Grokking