Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding
Noam Levi, Alon Beck, Yohai Bar-Sinai
TL;DR
The paper analyzes grokking in a minimal linear estimator by solving exact gradient-flow dynamics for a linear teacher–student model with Gaussian inputs. It shows that the delayed generalization can arise purely from covariance-driven dynamics, with grokking time primarily determined by $\lambda = \frac{d_{\mathrm{in}}}{N_{\mathrm{tr}}}$ and modulated by initialization, output dimension, and weight decay, rather than any qualitative shift to 'understanding'. The authors further develop semi-analytic results extended to 2-layer linear networks and provide evidence that some predictions persist under certain nonlinear activations in an NTK-like regime. Overall, the work offers a rigorous, interpretable framework linking dataset statistics to learning dynamics and clarifies how accuracy thresholds can mislead interpretations of grokking in neural networks.
Abstract
Grokking is the intriguing phenomenon where a model learns to generalize long after it has fit the training data. We show both analytically and numerically that grokking can surprisingly occur in linear networks performing linear tasks in a simple teacher-student setup with Gaussian inputs. In this setting, the full training dynamics is derived in terms of the training and generalization data covariance matrix. We present exact predictions on how the grokking time depends on input and output dimensionality, train sample size, regularization, and network initialization. We demonstrate that the sharp increase in generalization accuracy may not imply a transition from "memorization" to "understanding", but can simply be an artifact of the accuracy measure. We provide empirical verification for our calculations, along with preliminary results indicating that some predictions also hold for deeper networks, with non-linear activations.
