Table of Contents
Fetching ...

Grokking in Linear Models for Logistic Regression

Nataraj Das, Atreya Vedantam, Chandrashekar Lakshminarayanan

TL;DR

The paper shows that grokking—a delayed generalization phenomenon—can emerge in a simple linear classifier trained with logistic loss when a learnable bias is present. It develops a three-phase learning theory where the weight vector follows a $\mathbf w(t)=\hat{\mathbf w}\log t+\rho(t)$ trajectory while the bias $b(t)$ undergoes slow, SV-driven evolution, causing delayed generalization especially under margin-concentrated or adversarial test distributions. The authors provide explicit bounds linking grokking time to SV imbalance and margin sensitivity, and validate the theory with synthetic experiments in 2D and high dimensions, including PGD attacks. The results demonstrate that grokking does not require depth or representation learning, instead arising from implicit bias and late-time optimization dynamics, with practical implications for robustness under distribution shifts.

Abstract

Grokking, the phenomenon of delayed generalization, is often attributed to the depth and compositional structure of deep neural networks. We study grokking in one of the simplest possible settings: the learning of a linear model with logistic loss for binary classification on data that are linearly (and max margin) separable about the origin. We investigate three testing regimes: (1) test data drawn from the same distribution as the training data, in which case grokking is not observed; (2) test data concentrated around the margin, in which case grokking is observed; and (3) adversarial test data generated via projected gradient descent (PGD) attacks, in which case grokking is also observed. We theoretically show that the implicit bias of gradient descent induces a three-phase learning process-population-dominated, support-vector-dominated unlearning, and support-vector-dominated generalization-during which delayed generalization can arise. Our analysis further relates the emergence of grokking to asymmetries in the data, both in the number of examples per class and in the distribution of support vectors across classes, and yields a characterization of the grokking time. We experimentally validate our theory by planting different distributions of population points and support vectors, and by analyzing accuracy curves and hyperplane dynamics. Overall, our results demonstrate that grokking does not require depth or representation learning, and can emerge even in linear models through the dynamics of the bias term.

Grokking in Linear Models for Logistic Regression

TL;DR

The paper shows that grokking—a delayed generalization phenomenon—can emerge in a simple linear classifier trained with logistic loss when a learnable bias is present. It develops a three-phase learning theory where the weight vector follows a trajectory while the bias undergoes slow, SV-driven evolution, causing delayed generalization especially under margin-concentrated or adversarial test distributions. The authors provide explicit bounds linking grokking time to SV imbalance and margin sensitivity, and validate the theory with synthetic experiments in 2D and high dimensions, including PGD attacks. The results demonstrate that grokking does not require depth or representation learning, instead arising from implicit bias and late-time optimization dynamics, with practical implications for robustness under distribution shifts.

Abstract

Grokking, the phenomenon of delayed generalization, is often attributed to the depth and compositional structure of deep neural networks. We study grokking in one of the simplest possible settings: the learning of a linear model with logistic loss for binary classification on data that are linearly (and max margin) separable about the origin. We investigate three testing regimes: (1) test data drawn from the same distribution as the training data, in which case grokking is not observed; (2) test data concentrated around the margin, in which case grokking is observed; and (3) adversarial test data generated via projected gradient descent (PGD) attacks, in which case grokking is also observed. We theoretically show that the implicit bias of gradient descent induces a three-phase learning process-population-dominated, support-vector-dominated unlearning, and support-vector-dominated generalization-during which delayed generalization can arise. Our analysis further relates the emergence of grokking to asymmetries in the data, both in the number of examples per class and in the distribution of support vectors across classes, and yields a characterization of the grokking time. We experimentally validate our theory by planting different distributions of population points and support vectors, and by analyzing accuracy curves and hyperplane dynamics. Overall, our results demonstrate that grokking does not require depth or representation learning, and can emerge even in linear models through the dynamics of the bias term.
Paper Structure (28 sections, 7 theorems, 48 equations, 4 figures, 1 table)

This paper contains 28 sections, 7 theorems, 48 equations, 4 figures, 1 table.

Key Result

Theorem 4.1

For any smooth monotonically decreasing loss function with an exponential tail, and for small learning rate, gradient descent iterates follow where $\hat{w}$ is given by and $\rho(t)$ has a bounded norm.

Figures (4)

  • Figure 1: Grokking with a learnable bias. We analyze grokking in three scenarios: (1) when the test data is drawn from the same distribution as the test data (right); (2) when the test data is distributed around the optimal separator (sensitive distribution) (left); (3) adversarial test data generated via Project Gradient Descent attacks (center). We see that the model groks on the concentrated and adversarial test data whereas this is not observed in the standard test data.
  • Figure 2: Support Vector distributions affect grokking time. (a) Bias (left) (b) Distance from origin (middle) (c) train and concentrated test accuracy (right) versus number of epochs under different numbers of positive and negative support vectors, where D(1000,1000) mean 1000 positive class points and 1000 negative class points. Similarly S(350,350) means 350 support vectors for positive class and 350 support vectors for negative class. The three shaded regions refer to phase 1, phase 2, phase 3 respectively in all the plots.
  • Figure 3: Dataset. Chosen training and test distribution sampled $\mathcal{P}$ and $\mathcal{P}_{conc}$ in $d=2$ dimension (exaggerated for visibility). Since $\mathcal{P}_{conc}$ is concentrated around the seperator, it is sensitive to poorly generalizing solutions.
  • Figure 4: Evolution of (a) the bias $b(t)$ (left), (b) the distance of the separating hyperplane from the origin $|b(t)|/\|\mathbf{w}(t)\|$ (middle), and (c) training and concentrated test accuracies (right), as functions of training epochs, for varying degrees of positive and negative class imbalance,where D(1000,1000) mean 1000 positive class points and 1000 negative class points. Similarly S(350,350) means 350 support vectors for positive class and 350 support vectors for negative class.The three shaded regions refer to phase 1, phase 2, phase 3 respectively in all the plots.

Theorems & Definitions (13)

  • Definition 3.4: Grokking
  • Theorem 4.1: Rephrased from Theorem 3 of soudry
  • Theorem 4.2: Rephrased from Theorem 4 of soudry
  • Theorem 4.3
  • proof
  • Theorem 4.5
  • proof : Proof sketch:
  • Theorem 4.6
  • Remark 4.7
  • Theorem 1.1
  • ...and 3 more