Table of Contents
Fetching ...

Grokking as a First Order Phase Transition in Two Layer Networks

Noa Rubin, Inbar Seroussi, Zohar Ringel

TL;DR

The paper addresses the mechanism behind Grokking and feature learning in two-layer neural networks. It adopts a mean-field/adaptive kernel framework and maps Grokking to a first-order phase transition with a mixed Gaussian feature-learning phase (GMFL) after Grokking. It analyzes two toy models—a nonlinear cubic teacher and a modular algebra task—to derive analytic predictions, including phase diagrams with GMFL-I and GMFL-II regimes and a simplified scalar order parameter. It also reports numerical simulations showing that feature learning can reduce sample complexity and cause delayed generalization, linking Grokking to latent kernel adaptation. The results suggest a general, analytically tractable picture of representation learning in DNNs and potential practical implications for training and pruning.

Abstract

A key property of deep neural networks (DNNs) is their ability to learn new features during training. This intriguing aspect of deep learning stands out most clearly in recently reported Grokking phenomena. While mainly reflected as a sudden increase in test accuracy, Grokking is also believed to be a beyond lazy-learning/Gaussian Process (GP) phenomenon involving feature learning. Here we apply a recent development in the theory of feature learning, the adaptive kernel approach, to two teacher-student models with cubic-polynomial and modular addition teachers. We provide analytical predictions on feature learning and Grokking properties of these models and demonstrate a mapping between Grokking and the theory of phase transitions. We show that after Grokking, the state of the DNN is analogous to the mixed phase following a first-order phase transition. In this mixed phase, the DNN generates useful internal representations of the teacher that are sharply distinct from those before the transition.

Grokking as a First Order Phase Transition in Two Layer Networks

TL;DR

The paper addresses the mechanism behind Grokking and feature learning in two-layer neural networks. It adopts a mean-field/adaptive kernel framework and maps Grokking to a first-order phase transition with a mixed Gaussian feature-learning phase (GMFL) after Grokking. It analyzes two toy models—a nonlinear cubic teacher and a modular algebra task—to derive analytic predictions, including phase diagrams with GMFL-I and GMFL-II regimes and a simplified scalar order parameter. It also reports numerical simulations showing that feature learning can reduce sample complexity and cause delayed generalization, linking Grokking to latent kernel adaptation. The results suggest a general, analytically tractable picture of representation learning in DNNs and potential practical implications for training and pruning.

Abstract

A key property of deep neural networks (DNNs) is their ability to learn new features during training. This intriguing aspect of deep learning stands out most clearly in recently reported Grokking phenomena. While mainly reflected as a sudden increase in test accuracy, Grokking is also believed to be a beyond lazy-learning/Gaussian Process (GP) phenomenon involving feature learning. Here we apply a recent development in the theory of feature learning, the adaptive kernel approach, to two teacher-student models with cubic-polynomial and modular addition teachers. We provide analytical predictions on feature learning and Grokking properties of these models and demonstrate a mapping between Grokking and the theory of phase transitions. We show that after Grokking, the state of the DNN is analogous to the mixed phase following a first-order phase transition. In this mixed phase, the DNN generates useful internal representations of the teacher that are sharply distinct from those before the transition.
Paper Structure (26 sections, 63 equations, 6 figures, 1 table)

This paper contains 26 sections, 63 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Schematic phase diagram of learning phases. The inset plots in the different phases correspond to the normalized negative log posterior ($\hat{\mathcal{S}}$). A transition between the different phases can be achieved by varying either $n$ or $N$.
  • Figure 2: GFL to GMFL-I Theory and Experiment in the teacher-student model. Panel (a) and (b) show the negative log posterior before and after the phase transition induced by varying the noise $\sigma^2$. A good match with the experimental values (crosses) and theory (solid lines) for rarer and rarer events is obtained as we scale up the model according to Table 1. (colour coding). Turning to network outputs, panel (c) shows the expected phase transition in learning the cubic teacher component and the inset shows the discrepancy in the linear teacher component. As our analytics holds before and at the phase transitions, discrepancies in $\bar{f} \cdot H_3$ close to the transition are an expected finite-size effect made pronounced by its low absolute value.
  • Figure 3: GFL to GMFL-I to GMFL-II (a) Probability distribution of weights as predicted by our approach. The GFL phase is represented by the red graphs, where the minimum of the action at zero (shared also by the GP limit) dominates the probability distribution. The GMFL-I phase can be seen in the inset graph, and the final GMFL-II phase is shown in purple.(b) The target component of the average network output. Here singularities can be observed at the $\sigma^2$ values where the phase transitions occur. The Parameters taken in this calculation are: $N=1000,P=401,\sigma_a^2 = 0.002/N,\gamma = 0.0001$
  • Figure 4: Phase transition to GMFL-I phase for networks trained on finite sized data sets. Here we use $N$ to drive the phase transition, and observe that for the smaller $N$s two new local minima appear in the action, which become more significant by decreasing the width.
  • Figure 5: Dynamic delayed generalization. Here the train and test loss of an ensemble of 50 networks was tracked over the course of $7\cdot 10^6$ epochs, for networks of three different width. The average loss is shown in color, and the individual networks in gray. At $N=700$ a sharp drop in the test can be observed, as is commonly associated with Grokking nanda2023progressgromov2023grokking.
  • ...and 1 more figures