Grokking in Linear Models for Logistic Regression
Nataraj Das, Atreya Vedantam, Chandrashekar Lakshminarayanan
TL;DR
The paper shows that grokking—a delayed generalization phenomenon—can emerge in a simple linear classifier trained with logistic loss when a learnable bias is present. It develops a three-phase learning theory where the weight vector follows a $\mathbf w(t)=\hat{\mathbf w}\log t+\rho(t)$ trajectory while the bias $b(t)$ undergoes slow, SV-driven evolution, causing delayed generalization especially under margin-concentrated or adversarial test distributions. The authors provide explicit bounds linking grokking time to SV imbalance and margin sensitivity, and validate the theory with synthetic experiments in 2D and high dimensions, including PGD attacks. The results demonstrate that grokking does not require depth or representation learning, instead arising from implicit bias and late-time optimization dynamics, with practical implications for robustness under distribution shifts.
Abstract
Grokking, the phenomenon of delayed generalization, is often attributed to the depth and compositional structure of deep neural networks. We study grokking in one of the simplest possible settings: the learning of a linear model with logistic loss for binary classification on data that are linearly (and max margin) separable about the origin. We investigate three testing regimes: (1) test data drawn from the same distribution as the training data, in which case grokking is not observed; (2) test data concentrated around the margin, in which case grokking is observed; and (3) adversarial test data generated via projected gradient descent (PGD) attacks, in which case grokking is also observed. We theoretically show that the implicit bias of gradient descent induces a three-phase learning process-population-dominated, support-vector-dominated unlearning, and support-vector-dominated generalization-during which delayed generalization can arise. Our analysis further relates the emergence of grokking to asymmetries in the data, both in the number of examples per class and in the distribution of support vectors across classes, and yields a characterization of the grokking time. We experimentally validate our theory by planting different distributions of population points and support vectors, and by analyzing accuracy curves and hyperplane dynamics. Overall, our results demonstrate that grokking does not require depth or representation learning, and can emerge even in linear models through the dynamics of the bias term.
