Table of Contents
Fetching ...

Grokked Models are Better Unlearners

Yuanbang Liang, Yang Li

TL;DR

This work establishes a systematic link between grokking and machine unlearning, showing that models post-grokking possess modular, disentangled representations that enable more efficient and stable forgetting with less collateral damage across vision and language domains. By comparing pre- and post-grokking checkpoints across multiple unlearning algorithms, the study demonstrates improved forgetting efficiency, higher retention/test performance, and reduced gradient overlap between forgetting and retention. Mechanistic analyses reveal that grokking produces lower gradient correlations, simpler local representations, and greater representational disentanglement (e.g., reduced CKA), which collectively explain the superior unlearning capabilities. The findings imply that training dynamics promoting grokking can serve as a practical, orthogonal lever to enhance privacy-preserving unlearning without modifying unlearning algorithms, with broad relevance to real-world regulatory and security considerations.

Abstract

Grokking-delayed generalization that emerges well after a model has fit the training data-has been linked to robustness and representation quality. We ask whether this training regime also helps with machine unlearning, i.e., removing the influence of specified data without full retraining. We compare applying standard unlearning methods before versus after the grokking transition across vision (CNNs/ResNets on CIFAR, SVHN, and ImageNet) and language (a transformer on a TOFU-style setup). Starting from grokked checkpoints consistently yields (i) more efficient forgetting (fewer updates to reach a target forget level), (ii) less collateral damage (smaller drops on retained and test performance), and (iii) more stable updates across seeds, relative to early-stopped counterparts under identical unlearning algorithms. Analyses of features and curvature further suggest that post-grokking models learn more modular representations with reduced gradient alignment between forget and retain subsets, which facilitates selective forgetting. Our results highlight when a model is trained (pre- vs. post-grokking) as an orthogonal lever to how unlearning is performed, providing a practical recipe to improve existing unlearning methods without altering their algorithms.

Grokked Models are Better Unlearners

TL;DR

This work establishes a systematic link between grokking and machine unlearning, showing that models post-grokking possess modular, disentangled representations that enable more efficient and stable forgetting with less collateral damage across vision and language domains. By comparing pre- and post-grokking checkpoints across multiple unlearning algorithms, the study demonstrates improved forgetting efficiency, higher retention/test performance, and reduced gradient overlap between forgetting and retention. Mechanistic analyses reveal that grokking produces lower gradient correlations, simpler local representations, and greater representational disentanglement (e.g., reduced CKA), which collectively explain the superior unlearning capabilities. The findings imply that training dynamics promoting grokking can serve as a practical, orthogonal lever to enhance privacy-preserving unlearning without modifying unlearning algorithms, with broad relevance to real-world regulatory and security considerations.

Abstract

Grokking-delayed generalization that emerges well after a model has fit the training data-has been linked to robustness and representation quality. We ask whether this training regime also helps with machine unlearning, i.e., removing the influence of specified data without full retraining. We compare applying standard unlearning methods before versus after the grokking transition across vision (CNNs/ResNets on CIFAR, SVHN, and ImageNet) and language (a transformer on a TOFU-style setup). Starting from grokked checkpoints consistently yields (i) more efficient forgetting (fewer updates to reach a target forget level), (ii) less collateral damage (smaller drops on retained and test performance), and (iii) more stable updates across seeds, relative to early-stopped counterparts under identical unlearning algorithms. Analyses of features and curvature further suggest that post-grokking models learn more modular representations with reduced gradient alignment between forget and retain subsets, which facilitates selective forgetting. Our results highlight when a model is trained (pre- vs. post-grokking) as an orthogonal lever to how unlearning is performed, providing a practical recipe to improve existing unlearning methods without altering their algorithms.

Paper Structure

This paper contains 47 sections, 2 theorems, 20 equations, 2 figures, 12 tables.

Key Result

Theorem D.1

Under Assumptions 1-4, for two randomly sampled data points $x$ and $x'$, the expected gradient correlation is:

Figures (2)

  • Figure 1: Grokking Enables Superior Machine Unlearning.(a) Training Dynamics: ResNet training trajectory on CIFAR-10 showing the grokking phenomenon. The model initially learns (pink region) until conventional early stopping at $\theta_{\text{pre}}$ (gray line), then overfits with declining test accuracy. After extended training, the model suddenly "groks"—achieving delayed generalization with dramatically improved test accuracy (blue region) at $\theta_{\text{grok}}$. (b) Unlearning Performance Analysis: Unlearning effectiveness using gradient ascent measured across different training checkpoints. Note that higher Unlearning Accuracy (UA) indicates worse unlearning (the model still remembers what it should forget). During early training (pink region), UA exhibits a concerning upward trend with high volatility—as the model learns, it progressively memorizes forget examples more deeply, creating increasingly entangled representations. Critically, UA remains close to Retain Accuracy (RA), indicating that unlearning algorithms cannot effectively distinguish between forget and retain data due to highly entangled representations. However, after grokking (blue region), a dramatic separation emerges: while RA remains high (preserving useful knowledge), UA drops significantly below RA and stabilizes. This large gap between RA and UA demonstrates that grokked models enable selective forgetting—the model can effectively "forget" target data while preserving retained knowledge. This selectivity, combined with the reduced volatility, shows that grokking fundamentally reorganizes representations into a more modular, disentangled structure that enables reliable and precise unlearning operations.
  • Figure 2: Efficiency Advantages Depend on Task Difficulty. Convergence dynamics of $\nabla\tau$ unlearning on CNN (CIFAR-10) comparing grokked ($\theta_{\text{grok}}$) and pre-grokked ($\theta_{\text{pre}}$) models. (a) At moderate forget rates (15%), grokked models show substantial efficiency gains, achieving effective forgetting in 5-8 steps vs. 15-20 for pre-grokked models. (b) At challenging forget rates (50%), efficiency advantages become marginal, though grokked models still maintain more stable convergence.

Theorems & Definitions (3)

  • Theorem D.1: Pairwise Gradient Correlation
  • Proof D.1
  • Corollary D.2: Aggregate Gradient Correlation