Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition
Mohamad Amin Mohamadi, Zhiyuan Li, Lei Wu, Danica J. Sutherland
TL;DR
The paper investigates why grokking occurs in modular addition by proving a fundamental kernel-to-rich regime transition during gradient descent. It shows that permutation-equivariant kernel methods cannot generalize unless the training set covers a constant fraction of all possible data points, while regularized two-layer quadratic networks can generalize from far fewer samples once they leave the kernel regime. The authors establish both lower bounds and constructive upper bounds: lower bounds for kernel-based generalization in regression and classification, and rich-regime generalization guarantees with small $\ell_\infty$ norm (and margin-based PAC-Bayes bounds) that enable generalization with $\tilde{\mathcal{O}}(p^2)$ data for regression and $\tilde{\mathcal{O}}(p^{5/3})$ for classification. They provide theoretical results, a general framework for population loss lower bounds, and empirical evidence including Transformer-like models, supporting the grokking narrative as a delayed transition from kernel-dominated behavior to feature-learning dynamics. This work deepens the understanding of grokking and suggests practical regularization-based mechanisms to induce early generalization in overparameterized models.
Abstract
We present a theoretical explanation of the ``grokking'' phenomenon, where a model generalizes long after overfitting,for the originally-studied problem of modular addition. First, we show that early in gradient descent, when the ``kernel regime'' approximately holds, no permutation-equivariant model can achieve small population error on modular addition unless it sees at least a constant fraction of all possible data points. Eventually, however, models escape the kernel regime. We show that two-layer quadratic networks that achieve zero training loss with bounded $\ell_{\infty}$ norm generalize well with substantially fewer training points, and further show such networks exist and can be found by gradient descent with small $\ell_{\infty}$ regularization. We further provide empirical evidence that these networks as well as simple Transformers, leave the kernel regime only after initially overfitting. Taken together, our results strongly support the case for grokking as a consequence of the transition from kernel-like behavior to limiting behavior of gradient descent on deep networks.
