Table of Contents
Fetching ...

The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

Tiberiu Musat

TL;DR

This work analyzes grokking, the phenomenon where neural networks generalize only after extensive memorization, by viewing post-memorization learning as constrained optimization that minimizes the weight norm on the zero-loss manifold $\mathcal{Z}$. In the small-learning-rate and small-weight-decay regime, gradient flow preserves proximity to $\mathcal{Z}$ while weight decay drives norm reduction along available directions, a property formalized as gradient orthogonality to the tangent space of $\mathcal{Z}$. The authors introduce an approximation to isolate dynamics of parameter subsets and derive a closed-form first-layer dynamic for two-layer networks, enabling a tractable analysis of embedding-like components. Empirical validation on modular addition demonstrates both delayed generalization and the emergence of circular representations in the embedding layer, including a Fourier-analytic depiction of the learned structure. Overall, the results provide a principled mechanism for grokking and a framework for studying representation learning within subcomponents of neural nets.

Abstract

Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of the first layer in a two-layer network. Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.

The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

TL;DR

This work analyzes grokking, the phenomenon where neural networks generalize only after extensive memorization, by viewing post-memorization learning as constrained optimization that minimizes the weight norm on the zero-loss manifold . In the small-learning-rate and small-weight-decay regime, gradient flow preserves proximity to while weight decay drives norm reduction along available directions, a property formalized as gradient orthogonality to the tangent space of . The authors introduce an approximation to isolate dynamics of parameter subsets and derive a closed-form first-layer dynamic for two-layer networks, enabling a tractable analysis of embedding-like components. Empirical validation on modular addition demonstrates both delayed generalization and the emergence of circular representations in the embedding layer, including a Fourier-analytic depiction of the learned structure. Overall, the results provide a principled mechanism for grokking and a framework for studying representation learning within subcomponents of neural nets.

Abstract

Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of the first layer in a two-layer network. Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.

Paper Structure

This paper contains 33 sections, 9 theorems, 59 equations, 4 figures.

Key Result

Theorem 4.9

For every trajectory starting at a zero-loss solution $\theta(0) \in \mathcal{Z}$ and every $\varepsilon > 0$, there exists $\lambda_\varepsilon > 0$ such that for all $0 < \lambda < \lambda_\varepsilon$ the trajectory under $\mathcal{L}_\lambda$ satisfies

Figures (4)

  • Figure 1: A two-parameter linear model $\hat{y} = w_1x_1 + w_2x_2$ groks simple addition when trained with just one sample: $x_1 = x_2 = 1,\ y=2$ (corresponding to $1 + 1 = 2$). We plot three training runs with different weight decay coefficients $\lambda$. After quickly achieving (almost) zero loss, learning is entirely driven by the minimization of the weight norm.
  • Figure 2: A three-parameter linear model $\hat{y} = w_1 x_1 + w_2 x_2 + w_3 x_3$ groks three-number addition when trained with just one sample: $x_1=x_2=x_3=1,\ y=3$ (corresponding to $1 + 1 + 1= 3$). The gray area shows the zero-loss plane, shaded according to the weight norm, where a lighter shade denotes a lower norm.
  • Figure 3: Training trajectories with different data, architectures and initializations. Left: a two-layer linear network where the zero-loss set is curved. Center: a two-layer linear network where the zero-loss set has a singularity at $(w_1, w_2) = (0, 0)$. Right: a single-layer network with leaky ReLU activation groks simple addition.
  • Figure 4: Simulated dynamics according to \ref{['eq:two-layer-gradient-zero']} reproduce the phenomena of delayed generalization and representation learning. Top left: generalization emerges after about $1000$ steps, despite training loss being exactly zero throughout. Top right: Fourier features norms equalize, suggesting the presence of equally-sized circles. Bottom left: Fourier features become orthogonal, suggesting that circles are located in orthogonal planes. Bottom right: Fourier features absolute values become dissimilar, suggesting that each circle leverages a different subset of hidden activations.

Theorems & Definitions (23)

  • Definition 4.1: Zero-Loss Set
  • Definition 4.2: Singular Points
  • Definition 4.8: Distance
  • Theorem 4.9: Stability of $\ZZ$
  • Definition 4.10: Available Direction
  • Definition 4.11: Tangent Space
  • Definition 4.12: Projection
  • Theorem 4.13: Gradient Orthogonality
  • Remark 4.14
  • Theorem 5.1
  • ...and 13 more