Omnigrok: Grokking Beyond Algorithmic Data
Ziming Liu, Eric J. Michaud, Max Tegmark
TL;DR
This work analyzes grokking—the puzzling delay in generalization after overfitting—through neural loss landscapes, introducing the LU mechanism: training loss follows an L-shape while test loss follows a U-shape as a function of weight norm. By reducing the optimization to a one-dimensional landscape over weight norm and direction, the authors show how initialization scale and weight decay govern the time to generalize, including delays that scale as t ∝ 1/γ. They demonstrate grokking across diverse tasks (algorithmic data, MNIST, IMDb, QM9) and reveal that the strength of grokking correlates with the reliance on learning good representations. The study then argues that representation quality explains why grokking is dramatic for algorithmic datasets but muted for MNIST, and shows that constraining the weight norm can almost eliminate grokking. Overall, a loss-landscape perspective provides a coherent, predictive framework for grokking across domains and highlights representation learning as a central factor.
Abstract
Grokking, the unusual phenomenon for algorithmic datasets where generalization happens long after overfitting the training data, has remained elusive. We aim to understand grokking by analyzing the loss landscapes of neural networks, identifying the mismatch between training and test losses as the cause for grokking. We refer to this as the "LU mechanism" because training and test losses (against model weight norm) typically resemble "L" and "U", respectively. This simple mechanism can nicely explain many aspects of grokking: data size dependence, weight decay dependence, the emergence of representations, etc. Guided by the intuitive picture, we are able to induce grokking on tasks involving images, language and molecules. In the reverse direction, we are able to eliminate grokking for algorithmic datasets. We attribute the dramatic nature of grokking for algorithmic datasets to representation learning.
