Table of Contents
Fetching ...

When Data Falls Short: Grokking Below the Critical Threshold

Vaibhav Singh, Eugene Belilovsky, Rahaf Aljundi

TL;DR

The paper investigates grokking under data-scarce regimes and distribution shift, challenging the idea that grokking requires data above a critical size $D_{ ext{crit}}$. It demonstrates that knowledge distillation from a grokked teacher can induce, accelerate, and transfer grokking to new distributions, even when data is below $D_{ ext{crit}}$. In joint training on multiple distributions, KD from grokked teachers enables a larger model to generalize across distributions where supervised learning fails. In continual pretraining, KD mitigates forgetting and preserves prior knowledge while enabling rapid adaptation to new data, offering a practical path for deploying adaptable models under limited data.

Abstract

In this paper, we investigate the phenomenon of grokking, where models exhibit delayed generalization following overfitting on training data. We focus on data-scarce regimes where the number of training samples falls below the critical threshold, making grokking unobservable, and on practical scenarios involving distribution shift. We first show that Knowledge Distillation (KD) from a model that has already grokked on a distribution (p1) can induce and accelerate grokking on a different distribution (p2), even when the available data lies below the critical threshold. This highlights the value of KD for deployed models that must adapt to new distributions under limited data. We then study training on the joint distribution (p1, p2) and demonstrate that while standard supervised training fails when either distribution has insufficient data, distilling from models grokked on the individual distributions enables generalization. Finally, we examine a continual pretraining setup, where a grokked model transitions from p1 to p2, and find that KD both accelerates generalization and mitigates catastrophic forgetting, achieving strong performance even with only 10% of the data. Together, our results provide new insights into the mechanics of grokking under knowledge transfer and underscore the central role of KD in enabling generalization in low-data and evolving distribution settings.

When Data Falls Short: Grokking Below the Critical Threshold

TL;DR

The paper investigates grokking under data-scarce regimes and distribution shift, challenging the idea that grokking requires data above a critical size . It demonstrates that knowledge distillation from a grokked teacher can induce, accelerate, and transfer grokking to new distributions, even when data is below . In joint training on multiple distributions, KD from grokked teachers enables a larger model to generalize across distributions where supervised learning fails. In continual pretraining, KD mitigates forgetting and preserves prior knowledge while enabling rapid adaptation to new data, offering a practical path for deploying adaptable models under limited data.

Abstract

In this paper, we investigate the phenomenon of grokking, where models exhibit delayed generalization following overfitting on training data. We focus on data-scarce regimes where the number of training samples falls below the critical threshold, making grokking unobservable, and on practical scenarios involving distribution shift. We first show that Knowledge Distillation (KD) from a model that has already grokked on a distribution (p1) can induce and accelerate grokking on a different distribution (p2), even when the available data lies below the critical threshold. This highlights the value of KD for deployed models that must adapt to new distributions under limited data. We then study training on the joint distribution (p1, p2) and demonstrate that while standard supervised training fails when either distribution has insufficient data, distilling from models grokked on the individual distributions enables generalization. Finally, we examine a continual pretraining setup, where a grokked model transitions from p1 to p2, and find that KD both accelerates generalization and mitigates catastrophic forgetting, achieving strong performance even with only 10% of the data. Together, our results provide new insights into the mechanics of grokking under knowledge transfer and underscore the central role of KD in enabling generalization in low-data and evolving distribution settings.

Paper Structure

This paper contains 8 sections, 4 equations, 7 figures.

Figures (7)

  • Figure 1: \ref{['fig:scheme_11']} shows that $f_{S}$ groks below the critical data size when trained via KD from a grokked model $f_{T}$, whereas training from scratch fails. In \ref{['fig:scheme_22']}, a larger model $f_{M}$ trained jointly on $p_1$ and $p_2$ cannot generalize when either dataset is below the threshold. Distilling from the smaller grokked models $f_{S}$ and $f_{T}$, however, enables $f_{M}$ to grok and generalize effectively even under scarce data.
  • Figure 2: \ref{['fig:vg107']} shows the effectiveness of KD irrespective of the optimizer choice for both addition and subtraction modulo task. In \ref{['fig:vg113']} typical grokking phenomena on distribution $p_2$ on $30\%$ of training data (denoted as f), without KD. We observe that weight decay is helpful in showing grokking but its not the only underlying cause. When trained with Adam, grokking is not observed for both tasks when trained for 15000 iterations. This concurs with power2022grokking. However \ref{['fig:vg107']} demonstrates a $Student$ model trained on a different distribution $p_2$ with same fraction, but now with KD from the $Teacher$ model trained on $p_1$ displays accelerated generalization irrespective of the optimizer choice.
  • Figure 3: Dashed lines in\ref{['fig:Addition_107_kd']} and \ref{['fig:subtraction_107_kd']} show typical grokking on $p_{2}$$(P=107)$ with different training fractions ($f$). Training from scratch below $30\%$ shows no grokking. With KD from a grokked model on $p_{1}$$(P=113)$, grokking is accelerated and occurs with as little as $25\%$ of $p_{2}$. Distillation is applied to probability outputs from the operator token, enabling generic operator-level representations rather than $P$-specific ones.
  • Figure 4: \ref{['fig:without_kd_single']} demonstrates that its impossible to observe grokking when the data fraction goes below a certain critical threshold($20\%$.), even with 2X iterations (30,000) In such a case, the model does not learn anything regardless of the optimizer. In \ref{['fig:with_kd_single']}, it can be clearly seen that with KD, grokking is observed for all tasks, even without weight decay. However we notice that weight decay helps in achieving a better generalisation.
  • Figure 5: Performance comparison of training strategies for a larger transformer model $f_{M}$ on distributions of $p_1$(35%) and different fractions $(0.35, 0.3, 0.25)$ of $p_2$. In the Joint Training regime (\ref{['fig:without_kd']}), the model fails to generalize via cross-entropy when data from either distribution falls below the critical threshold. In contrast, training solely with distillation enables grokking even with $25\%$ of $p_2$ (\ref{['fig:with_kd']}). At this low fraction, generalization does not reach unity due to the imperfect $f_{p_2}$ trained under data scarcity, while for $0.35$ and $0.3$ fractions, generalization is rapid with no grokking.
  • ...and 2 more figures