Grokking as a Phase Transition between Competing Basins: a Singular Learning Theory Approach

Ben Cullen; Sergio Estan-Ruiz; Riya Danait; Jiayi Li

Grokking as a Phase Transition between Competing Basins: a Singular Learning Theory Approach

Ben Cullen, Sergio Estan-Ruiz, Riya Danait, Jiayi Li

TL;DR

Singular Learning Theory is studied through the lens of Singular Learning Theory, a Bayesian framework that characterizes the geometry of the loss landscape via the local learning coefficient, a measure of the local degeneracy of the loss surface, to find evidence that lower-LLC basins are linked to higher posterior mass concentration and lower expected generalisation error.

Abstract

Grokking, the abrupt transition from memorization to generalisation after extended training, suggests the presence of competing solution basins with distinct statistical properties. We study this phenomenon through the lens of Singular Learning Theory (SLT), a Bayesian framework that characterizes the geometry of the loss landscape via the local learning coefficient (LLC), a measure of the local degeneracy of the loss surface. SLT links lower-LLC basins to higher posterior mass concentration and lower expected generalisation error. Leveraging this theory, we interpret grokking in quadratic networks as a phase transition between competing near-zero-loss solution basins. Our contributions are two-fold: we derive closed-form expressions for the LLC in quadratic networks trained on modular arithmetic tasks, with the corresponding empirical verification; as well as empirical evidence demonstrating that LLC trajectories provide a reliable tool for tracking generalisation dynamics and interpreting phase transitions during training.

Grokking as a Phase Transition between Competing Basins: a Singular Learning Theory Approach

TL;DR

Abstract

Paper Structure (67 sections, 20 theorems, 157 equations, 14 figures)

This paper contains 67 sections, 20 theorems, 157 equations, 14 figures.

Introduction
Contributions.
Background
The Local Learning Coefficient
Regular versus singular models.
Local parameter degeneracy and the LLC.
Bayesian phase transitions and the LLC.
LLC estimation.
Related work
Empirical discovery and mechanistic accounts of grokking.
Bayesian phase transition.
Hidden progress and complexity-driven transitions.
Flatness of loss landscape.
Singular learning theory and LLC.
Problem Set-up
...and 52 more sections

Key Result

Theorem 4.1

Let $f_{\theta}$ be a quadratic network with no bias terms and hidden width $K \geq \frac{d(d+1)}{2}$. Let $\theta^{*}=(W^{*},V^{*})$ be a true solution and assume that there exists an index set $I\subset \{1,\dots,K\}$ with $|I|=\frac{d(d+1)}{2}$ such that $\{w_{i}^{*} {w_{i}^{*}}^{T}\}_{i\in I}$ s

Figures (14)

Figure 1: Linear relationship between the $p$ and the final LLC of the trained model. Experiment repeated for several values of dimensions of the model's hidden layer.
Figure 2: Linear relationship between the dimension of the model's hidden layer and the final LLC of the trained model. Experiment repeated for several values of $p$.
Figure 3: Plot showing the training and validation loss, as well as the LLC curve, during training for a model trained for $p=53$, learning rate 0.0001, weight decay 0.0001, batch size 128 and hidden dimension 1024.
Figure 4: The same model is being trained with different learning rates. The learning rate is plotted against its GSM.
Figure 5: Three different toy landscapes and their respective LLC values at minima of their surfaces. Plots taken from hoogland2025loss.
...and 9 more figures

Theorems & Definitions (49)

Theorem 4.1
proof
Theorem 4.2
proof
Remark 4.3
Theorem 5.1: LLC of NTK models
proof
Theorem 5.2: LLC in the lazy (random-feature) memorisation regime
proof
Corollary 5.3: Modular arithmetic with quadratic activation
...and 39 more

Grokking as a Phase Transition between Competing Basins: a Singular Learning Theory Approach

TL;DR

Abstract

Grokking as a Phase Transition between Competing Basins: a Singular Learning Theory Approach

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (49)