Table of Contents
Fetching ...

How to explain grokking

S. V. Kozyrev

TL;DR

The paper tackles delayed generalization (grokking) in overparameterized models by framing learning dynamics as stochastic gradient Langevin dynamics (SGLD), i.e., diffusion in a potential. It derives the corresponding diffusion/Fokker-Planck equations, showing Gibbs distribution as the stationary state and linking optimization to free-energy minimization through an Eyring-type barrier framework. The authors propose that grokking arises when Brownian motion on a zero-risk manifold transitions from low-entropy overfitting regions to high-entropy generalizing regions as training data grow, with threshold behavior and time-scaling explained by entropic barriers. This thermodynamic perspective offers a unified account of delayed generalization and makes testable predictions about how data size and optimization dynamics affect grokking.

Abstract

Explanation of grokking (delayed generalization) in learning is given by modeling grokking by the stochastic gradient Langevin dynamics (Brownian motion) and applying the ideas of thermodynamics.

How to explain grokking

TL;DR

The paper tackles delayed generalization (grokking) in overparameterized models by framing learning dynamics as stochastic gradient Langevin dynamics (SGLD), i.e., diffusion in a potential. It derives the corresponding diffusion/Fokker-Planck equations, showing Gibbs distribution as the stationary state and linking optimization to free-energy minimization through an Eyring-type barrier framework. The authors propose that grokking arises when Brownian motion on a zero-risk manifold transitions from low-entropy overfitting regions to high-entropy generalizing regions as training data grow, with threshold behavior and time-scaling explained by entropic barriers. This thermodynamic perspective offers a unified account of delayed generalization and makes testable predictions about how data size and optimization dynamics affect grokking.

Abstract

Explanation of grokking (delayed generalization) in learning is given by modeling grokking by the stochastic gradient Langevin dynamics (Brownian motion) and applying the ideas of thermodynamics.

Paper Structure

This paper contains 6 sections, 16 equations.