Table of Contents
Fetching ...

Grokking as a Phase Transition between Competing Basins: a Singular Learning Theory Approach

Ben Cullen, Sergio Estan-Ruiz, Riya Danait, Jiayi Li

TL;DR

Singular Learning Theory is studied through the lens of Singular Learning Theory, a Bayesian framework that characterizes the geometry of the loss landscape via the local learning coefficient, a measure of the local degeneracy of the loss surface, to find evidence that lower-LLC basins are linked to higher posterior mass concentration and lower expected generalisation error.

Abstract

Grokking, the abrupt transition from memorization to generalisation after extended training, suggests the presence of competing solution basins with distinct statistical properties. We study this phenomenon through the lens of Singular Learning Theory (SLT), a Bayesian framework that characterizes the geometry of the loss landscape via the local learning coefficient (LLC), a measure of the local degeneracy of the loss surface. SLT links lower-LLC basins to higher posterior mass concentration and lower expected generalisation error. Leveraging this theory, we interpret grokking in quadratic networks as a phase transition between competing near-zero-loss solution basins. Our contributions are two-fold: we derive closed-form expressions for the LLC in quadratic networks trained on modular arithmetic tasks, with the corresponding empirical verification; as well as empirical evidence demonstrating that LLC trajectories provide a reliable tool for tracking generalisation dynamics and interpreting phase transitions during training.

Grokking as a Phase Transition between Competing Basins: a Singular Learning Theory Approach

TL;DR

Singular Learning Theory is studied through the lens of Singular Learning Theory, a Bayesian framework that characterizes the geometry of the loss landscape via the local learning coefficient, a measure of the local degeneracy of the loss surface, to find evidence that lower-LLC basins are linked to higher posterior mass concentration and lower expected generalisation error.

Abstract

Grokking, the abrupt transition from memorization to generalisation after extended training, suggests the presence of competing solution basins with distinct statistical properties. We study this phenomenon through the lens of Singular Learning Theory (SLT), a Bayesian framework that characterizes the geometry of the loss landscape via the local learning coefficient (LLC), a measure of the local degeneracy of the loss surface. SLT links lower-LLC basins to higher posterior mass concentration and lower expected generalisation error. Leveraging this theory, we interpret grokking in quadratic networks as a phase transition between competing near-zero-loss solution basins. Our contributions are two-fold: we derive closed-form expressions for the LLC in quadratic networks trained on modular arithmetic tasks, with the corresponding empirical verification; as well as empirical evidence demonstrating that LLC trajectories provide a reliable tool for tracking generalisation dynamics and interpreting phase transitions during training.
Paper Structure (67 sections, 20 theorems, 157 equations, 14 figures)

This paper contains 67 sections, 20 theorems, 157 equations, 14 figures.

Key Result

Theorem 4.1

Let $f_{\theta}$ be a quadratic network with no bias terms and hidden width $K \geq \frac{d(d+1)}{2}$. Let $\theta^{*}=(W^{*},V^{*})$ be a true solution and assume that there exists an index set $I\subset \{1,\dots,K\}$ with $|I|=\frac{d(d+1)}{2}$ such that $\{w_{i}^{*} {w_{i}^{*}}^{T}\}_{i\in I}$ s

Figures (14)

  • Figure 1: Linear relationship between the $p$ and the final LLC of the trained model. Experiment repeated for several values of dimensions of the model's hidden layer.
  • Figure 2: Linear relationship between the dimension of the model's hidden layer and the final LLC of the trained model. Experiment repeated for several values of $p$.
  • Figure 3: Plot showing the training and validation loss, as well as the LLC curve, during training for a model trained for $p=53$, learning rate 0.0001, weight decay 0.0001, batch size 128 and hidden dimension 1024.
  • Figure 4: The same model is being trained with different learning rates. The learning rate is plotted against its GSM.
  • Figure 5: Three different toy landscapes and their respective LLC values at minima of their surfaces. Plots taken from hoogland2025loss.
  • ...and 9 more figures

Theorems & Definitions (49)

  • Theorem 4.1
  • proof
  • Theorem 4.2
  • proof
  • Remark 4.3
  • Theorem 5.1: LLC of NTK models
  • proof
  • Theorem 5.2: LLC in the lazy (random-feature) memorisation regime
  • proof
  • Corollary 5.3: Modular arithmetic with quadratic activation
  • ...and 39 more