Table of Contents
Fetching ...

A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization

Shalima Binta Manir, Anamika Paul Rupa

Abstract

Grokking the delayed transition from memorization to generalization in neural networks remains poorly understood, in part because prior empirical studies confound the roles of architecture, optimization, and regularization. We present a controlled study that systematically disentangles these factors on modular addition (mod 97), with matched and carefully tuned training regimes across models. Our central finding is that grokking dynamics are not primarily determined by architecture, but by interactions between optimization stability and regularization. Specifically, we show: (1) \textbf{depth has a non-monotonic effect}, with depth-4 MLPs consistently failing to grok while depth-8 residual networks recover generalization, demonstrating that depth requires architectural stabilization; (2) \textbf{the apparent gap between Transformers and MLPs largely disappears} (1.11$\times$ delay) under matched hyperparameters, indicating that previously reported differences are largely due to optimizer and regularization confounds; (3) \textbf{activation function effects are regime-dependent}, with GELU up to 4.3$\times$ faster than ReLU only when regularization permits memorization; and (4) \textbf{weight decay is the dominant control parameter}, exhibiting a narrow ``Goldilocks'' regime in which grokking occurs, while too little or too much prevents generalization. Across 3--5 seeds per configuration, these results provide a unified empirical account of grokking as an interaction-driven phenomenon. Our findings challenge architecture-centric interpretations and clarify how optimization and regularization jointly govern delayed generalization.

A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization

Abstract

Grokking the delayed transition from memorization to generalization in neural networks remains poorly understood, in part because prior empirical studies confound the roles of architecture, optimization, and regularization. We present a controlled study that systematically disentangles these factors on modular addition (mod 97), with matched and carefully tuned training regimes across models. Our central finding is that grokking dynamics are not primarily determined by architecture, but by interactions between optimization stability and regularization. Specifically, we show: (1) \textbf{depth has a non-monotonic effect}, with depth-4 MLPs consistently failing to grok while depth-8 residual networks recover generalization, demonstrating that depth requires architectural stabilization; (2) \textbf{the apparent gap between Transformers and MLPs largely disappears} (1.11 delay) under matched hyperparameters, indicating that previously reported differences are largely due to optimizer and regularization confounds; (3) \textbf{activation function effects are regime-dependent}, with GELU up to 4.3 faster than ReLU only when regularization permits memorization; and (4) \textbf{weight decay is the dominant control parameter}, exhibiting a narrow ``Goldilocks'' regime in which grokking occurs, while too little or too much prevents generalization. Across 3--5 seeds per configuration, these results provide a unified empirical account of grokking as an interaction-driven phenomenon. Our findings challenge architecture-centric interpretations and clarify how optimization and regularization jointly govern delayed generalization.

Paper Structure

This paper contains 32 sections, 7 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Canonical grokking on modular addition mod 97 (direct experiment output). Train accuracy (blue) reaches 100% at step $T_\text{train} = 1{,}000$, while test accuracy (orange) remains near chance until a sharp transition at step $T_\text{grok} = 33{,}000$ (grokking delay $= 32{,}000$ steps). Test accuracy stabilizes at $\approx$99.35% and continues to slowly improve through step 400,000. The dashed line marks the 99% threshold.
  • Figure 2: H1: Depth-8 residual MLP grokking delay across 5 seeds. Seeds 3 and 4 (orange) did not grok within the 400,000-step budget (shown as 0); seeds 0--2 grokked with a mean $\Delta T = 33{,}333 \pm 12{,}858$ steps (dashed line).
  • Figure 3: H2: Transformer vs. MLP grokking comparison over 5 seeds (corrected hyperparameter configs). (a) Per-seed grokking delays $\Delta T = T_\text{test} - T_\text{train}$; dashed lines are per-architecture means. (b) Mean delay $\pm$ std: MLP achieves $45.6\text{k} \pm 5.6\text{k}$ vs. Transformer $50.8\text{k} \pm 22.6\text{k}$, a 1.11$\times$ difference with $4.1\times$ higher variance. (c)$T_\text{train}$ vs. $T_\text{test}$ scatter: both architectures memorize rapidly but diverge in generalization onset.
  • Figure 4: H3: Activation function comparison at two hyperparameter configurations. Left (Sweep A): original H3 config (lr $= 10^{-2}$, $\lambda = 2\times10^{-3}$, width 256). GELU fails entirely (0/5 seeds); ReLU and Tanh grok in 2/5 and 3/5 seeds with large delays. Right (Sweep B): H2 MLP baseline config (lr $= 3\times10^{-2}$, $\lambda = 5\times10^{-4}$, width 512). GELU dominates with 5/5 seeds at 45.6k steps---4.32$\times$ faster than ReLU. The reversal between sweeps demonstrates an activation--weight decay interaction: GELU's advantage only emerges when regularization is light enough to permit memorization.
  • Figure 5: H4: Weight decay sweeps for MLP (SGD) and Transformer (AdamW), plus fair comparison at optimal $\lambda$ each (5 seeds). (a) MLP $\lambda$ sweep: optimal at $\lambda = 10^{-3}$; $\lambda \geq 2\times10^{-3}$ prevents memorization entirely. (b) Transformer $\lambda$ sweep: optimal at $\lambda = 5.0$, requiring 5,000$\times$ stronger regularization than the MLP. (c) Fair comparison at optimal $\lambda$ each (5 seeds): MLP achieves $26.8\text{k} \pm 6.4\text{k}$ vs. Transformer $50.8\text{k} \pm 38.7\text{k}$, a 1.90$\times$ gap---the definitive architecture comparison superseding H2's 1.11$\times$.
  • ...and 3 more figures