Table of Contents
Fetching ...

Grokking vs. Learning: Same Features, Different Encodings

Dmitry Manning-Coe, Jacopo Gliozzi, Alexander G. Stapleton, Edward Hirst, Giuseppe De Tomasi, Barry Bradlyn, David S. Berman

TL;DR

This work probes whether grokking and ordinary steady training yield fundamentally different models by comparing their learned features, compressibility, and training dynamics on two interpretable tasks (Ising-model phase classification and modular addition). It shows that both paths converge on the same features (energy and magnetization for Ising; Fourier coefficients for modular addition), but encodings differ, with a novel compressive regime appearing only in steady learning, enabling substantial model compression. The authors introduce information-geometric measures based on the Fisher Information Metric to analyze training trajectories, uncovering task-dependent pre-grokking dynamics and, in at least one task, a near-straight-line trajectory in information space for grokking. These findings illuminate when grokking offers practical benefits and suggest a framework for interpreting and compressing models via geometry-driven diagnostics.

Abstract

Grokking typically achieves similar loss to ordinary, "steady", learning. We ask whether these different learning paths - grokking versus ordinary training - lead to fundamental differences in the learned models. To do so we compare the features, compressibility, and learning dynamics of models trained via each path in two tasks. We find that grokked and steadily trained models learn the same features, but there can be large differences in the efficiency with which these features are encoded. In particular, we find a novel "compressive regime" of steady training in which there emerges a linear trade-off between model loss and compressibility, and which is absent in grokking. In this regime, we can achieve compression factors 25x times the base model, and 5x times the compression achieved in grokking. We then track how model features and compressibility develop through training. We show that model development in grokking is task-dependent, and that peak compressibility is achieved immediately after the grokking plateau. Finally, novel information-geometric measures are introduced which demonstrate that models undergoing grokking follow a straight path in information space.

Grokking vs. Learning: Same Features, Different Encodings

TL;DR

This work probes whether grokking and ordinary steady training yield fundamentally different models by comparing their learned features, compressibility, and training dynamics on two interpretable tasks (Ising-model phase classification and modular addition). It shows that both paths converge on the same features (energy and magnetization for Ising; Fourier coefficients for modular addition), but encodings differ, with a novel compressive regime appearing only in steady learning, enabling substantial model compression. The authors introduce information-geometric measures based on the Fisher Information Metric to analyze training trajectories, uncovering task-dependent pre-grokking dynamics and, in at least one task, a near-straight-line trajectory in information space for grokking. These findings illuminate when grokking offers practical benefits and suggest a framework for interpreting and compressing models via geometry-driven diagnostics.

Abstract

Grokking typically achieves similar loss to ordinary, "steady", learning. We ask whether these different learning paths - grokking versus ordinary training - lead to fundamental differences in the learned models. To do so we compare the features, compressibility, and learning dynamics of models trained via each path in two tasks. We find that grokked and steadily trained models learn the same features, but there can be large differences in the efficiency with which these features are encoded. In particular, we find a novel "compressive regime" of steady training in which there emerges a linear trade-off between model loss and compressibility, and which is absent in grokking. In this regime, we can achieve compression factors 25x times the base model, and 5x times the compression achieved in grokking. We then track how model features and compressibility develop through training. We show that model development in grokking is task-dependent, and that peak compressibility is achieved immediately after the grokking plateau. Finally, novel information-geometric measures are introduced which demonstrate that models undergoing grokking follow a straight path in information space.

Paper Structure

This paper contains 26 sections, 15 equations, 34 figures, 1 table.

Figures (34)

  • Figure 1: Snapshots of the 16 x 16 Ising lattice, where white and black squares indicate $\pm 1$ spin assignments. As temperature increases, snapshots become more disordered, passing through a phase transition at temperature $T_c$.
  • Figure 2: Ising and modular addition training curves, for varying weight multipliers at initialization (larger values are redder). The dashed lines indicate training accuracies/losses, and the solid lines test accuracies/losses.
  • Figure 3: In the Ising task, the model learns the energy for both the grokking (a,c) and learning (b,d) regimes. In the top row, we show a scatter plot of the energy of input snapshots versus the resulting neuron pre-activations in the final layer. Data is shown for only most activated neuron. In the second row, we show that this high degree of correlation is typical of all neurons in the final layer.
  • Figure 4: In the modular addition task, the model learns the Fourier modes of the input data. To see this, we Fourier transform the unembedding matrix and plot its coefficients, which correspond to different Fourier frequencies. The heatmaps for (a) grokking and (b) learning suggest that each neuron predominantly activates on the sine and cosine of a single frequency. This is confirmed by calculating the IPR for all neurons, shown in (c) for grokking and (d) for learning.
  • Figure 5: Pruning curves for (a) Ising and (b) modular addition. In the modular addition task, there is a linear trade-off between the end-loss and the compressibility $c$, shown in (c), while no such relationship for the Ising task, shown in (d). Data in the top row is averaged over seeds, while the bottom row shows all seeds.
  • ...and 29 more figures

Theorems & Definitions (2)

  • Definition 1: Grokking time
  • Definition 2: Grokking