When Data Falls Short: Grokking Below the Critical Threshold

Vaibhav Singh; Eugene Belilovsky; Rahaf Aljundi

When Data Falls Short: Grokking Below the Critical Threshold

Vaibhav Singh, Eugene Belilovsky, Rahaf Aljundi

TL;DR

The paper investigates grokking under data-scarce regimes and distribution shift, challenging the idea that grokking requires data above a critical size $D_{ ext{crit}}$. It demonstrates that knowledge distillation from a grokked teacher can induce, accelerate, and transfer grokking to new distributions, even when data is below $D_{ ext{crit}}$. In joint training on multiple distributions, KD from grokked teachers enables a larger model to generalize across distributions where supervised learning fails. In continual pretraining, KD mitigates forgetting and preserves prior knowledge while enabling rapid adaptation to new data, offering a practical path for deploying adaptable models under limited data.

Abstract

In this paper, we investigate the phenomenon of grokking, where models exhibit delayed generalization following overfitting on training data. We focus on data-scarce regimes where the number of training samples falls below the critical threshold, making grokking unobservable, and on practical scenarios involving distribution shift. We first show that Knowledge Distillation (KD) from a model that has already grokked on a distribution (p1) can induce and accelerate grokking on a different distribution (p2), even when the available data lies below the critical threshold. This highlights the value of KD for deployed models that must adapt to new distributions under limited data. We then study training on the joint distribution (p1, p2) and demonstrate that while standard supervised training fails when either distribution has insufficient data, distilling from models grokked on the individual distributions enables generalization. Finally, we examine a continual pretraining setup, where a grokked model transitions from p1 to p2, and find that KD both accelerates generalization and mitigates catastrophic forgetting, achieving strong performance even with only 10% of the data. Together, our results provide new insights into the mechanics of grokking under knowledge transfer and underscore the central role of KD in enabling generalization in low-data and evolving distribution settings.

When Data Falls Short: Grokking Below the Critical Threshold

TL;DR

Abstract

When Data Falls Short: Grokking Below the Critical Threshold

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)