Table of Contents
Fetching ...

Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking

Yuandong Tian

TL;DR

The paper introduces Li$_2$, a principled gradient-dynamics framework that decomposes grokking into three stages: lazy learning, independent feature learning, and interactive feature learning. It shows how leaked gradient signals in Stage I trigger Stage II’s energy-driven, nodewise feature emergence, and how Stage III interactions promote diversity and refinement via repulsion and top-down modulation, with Muon accelerating exploration. The framework provides provable scaling laws for when features generalize versus memorize, characterizes local maxima of an energy landscape tied to nonlinear canonical correlation, and extends to deeper architectures. It also explains the role of hyperparameters (weight decay, learning rate, data size, Muon) in shaping grokking and provides a path toward first-principles understanding of feature emergence in structured-input settings. The results unify group-theoretic structure with gradient dynamics to account for efficient feature representations and generalization under data constraints, with practical implications for optimizer design and data-efficiency in structured tasks.

Abstract

While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open problem whether there is a mathematical framework that characterizes what kind of features will emerge, how and in which conditions it happens, and is closely related to the gradient dynamics of the training, for complex structured inputs. We propose a novel framework, named $\mathbf{Li}_2$, that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) Lazy learning, (II) independent feature learning and (III) interactive feature learning. At the lazy learning stage, top layer overfits to random hidden representation and the model appears to memorize, and at the same time, the backpropagated gradient $G_F$ from the top layer now carries information about the target label, with a specific structure that enables each hidden node to learn their representation independently. Interestingly, the independent dynamics follows exactly the gradient ascent of an energy function $E$, and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. When hidden nodes start to interact in the later stage of learning, we provably show how $G_F$ changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of feature emergence, memorization and generalization, and reveals why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layers. The code is available at https://github.com/yuandong-tian/understanding/tree/main/ssl/real-dataset/cogo.

Provable Scaling Laws of Feature Emergence from Learning Dynamics of Grokking

TL;DR

The paper introduces Li, a principled gradient-dynamics framework that decomposes grokking into three stages: lazy learning, independent feature learning, and interactive feature learning. It shows how leaked gradient signals in Stage I trigger Stage II’s energy-driven, nodewise feature emergence, and how Stage III interactions promote diversity and refinement via repulsion and top-down modulation, with Muon accelerating exploration. The framework provides provable scaling laws for when features generalize versus memorize, characterizes local maxima of an energy landscape tied to nonlinear canonical correlation, and extends to deeper architectures. It also explains the role of hyperparameters (weight decay, learning rate, data size, Muon) in shaping grokking and provides a path toward first-principles understanding of feature emergence in structured-input settings. The results unify group-theoretic structure with gradient dynamics to account for efficient feature representations and generalization under data constraints, with practical implications for optimizer design and data-efficiency in structured tasks.

Abstract

While the phenomenon of grokking, i.e., delayed generalization, has been studied extensively, it remains an open problem whether there is a mathematical framework that characterizes what kind of features will emerge, how and in which conditions it happens, and is closely related to the gradient dynamics of the training, for complex structured inputs. We propose a novel framework, named , that captures three key stages for the grokking behavior of 2-layer nonlinear networks: (I) Lazy learning, (II) independent feature learning and (III) interactive feature learning. At the lazy learning stage, top layer overfits to random hidden representation and the model appears to memorize, and at the same time, the backpropagated gradient from the top layer now carries information about the target label, with a specific structure that enables each hidden node to learn their representation independently. Interestingly, the independent dynamics follows exactly the gradient ascent of an energy function , and its local maxima are precisely the emerging features. We study whether these local-optima induced features are generalizable, their representation power, and how they change on sample size, in group arithmetic tasks. When hidden nodes start to interact in the later stage of learning, we provably show how changes to focus on missing features that need to be learned. Our study sheds lights on roles played by key hyperparameters such as weight decay, learning rate and sample sizes in grokking, leads to provable scaling laws of feature emergence, memorization and generalization, and reveals why recent optimizers such as Muon can be effective, from the first principles of gradient dynamics. Our analysis can be extended to multi-layers. The code is available at https://github.com/yuandong-tian/understanding/tree/main/ssl/real-dataset/cogo.

Paper Structure

This paper contains 48 sections, 36 theorems, 230 equations, 17 figures.

Key Result

Proposition 1

If $\tilde{F}$ is fixed and is full column rank, entries of $V(0)$ is initialized from normal distribution $N(0, \alpha^2)$ with $0<\alpha\ll 1$, then $\|G_F(0)\|_F = O(\epsilon\sqrt{KM})$ and the backpropagated gradient $G_F$ is dominated by the term $\tilde{Y} \tilde{Y}^\top F$ at initial time sta and converges exponentially to the following fixed point when $V = V_{\textrm{ridge}} = (\tilde{F}^

Figures (17)

  • Figure 1: Overview of our framework Li$_2$. Li$_2$ proposes three stages of the learning process, (I) Lazy learning, (II) independent feature learning and (III) interactive feature learning, to explain the dynamics of grokking that shows the network first memorizes then generalizes (see Fig. \ref{['fig:li2-details']} for details). Our analysis goes beyond Neural Tangent Kernel (NTK) and mean field regime, and characterizes concretely how features emerge from gradient dynamics with the help of the energy function $\mathcal{E}$ (Thm. \ref{['theorem:energythm']}) and multiple key factors that affect the procedure. Specifically, we characterize the learned features as local maxima of $\mathcal{E}$ (Thm. \ref{['thm:local_maxima']}) and the required sample size to maintain them (Thm. \ref{['thm:dataforgeneralization']}), establishing generalization/memorization scaling laws.
  • Figure 2: Three stages of Li$_2$ framework. (a) Random weight initialization. (b)Stage I: Model first learns to overfit the data with the random features provided by the hidden layer, while the hidden layer does not change much due to noisy backpropagated gradient $G_F$, (c)Stage II: Once the output layer overfits the data, $G_F$ becomes related to target label $\tilde{Y}$ with suitable weight decay $\eta$. Moreover, $G_F$ acts independently on each hidden neuron, and push them to learn features with the energy function $\mathcal{E}$ (Thm. \ref{['theorem:energythm']}), (d)Stage III: Hidden layer learns some features, interactions appear (Thm. \ref{['thm:repulsion']}) and the backpropagated gradient $G_F$ now carries information about the residual $\tilde{Y} - \hat{Y}$ to push the hidden layer to learn missing features (Thm. \ref{['thm:top-down-modulation']}).
  • Figure 3: Grokking dynamics on modular addition task with $M=71$, $K=2048$, $n=2016$ ($40\%$ training out of $71^2$ samples) with and without weight decay. Top: $\eta = 0.0002$ and grokking happens. Bottom: $\eta=0$ and no grokking happens. Weight decay leads to larger $|G_F|$ around epoch $100$ and induces grokking behavior. The weights difference $\Delta W$ between consecutive weights at time $t$ and $t+1$, measured by cosine distance, shows two-stage behaviors: first there is huge update on the output weight $V$, then large update on the hidden weight $W$. Throughout the training, $\tilde{F}^\top \tilde{F}$ and $P^\perp_1 F F^\top$ remains diagonal with up to $8\%$ error, validating our analysis (independent feature learning, Sec. \ref{['sec:independent_feature_learning']}). Experiments averaged over $15$ seeds.
  • Figure 4: Change of the landscape of the energy function $\mathcal{E}$ (Thm. \ref{['theorem:energythm']}). Left:$\mathcal{E}$ with linear activation reduces to simple eigen-decomposition and only have one global maxima. Middle: With nonlinearity, the energy landscape now has multiple strict local maxima, each corresponds to a feature (Thm. \ref{['thm:local_maxima']}). More importantly, these features are more efficient than memorization in target prediction (Thm. \ref{['thm:predictedtarget']}). Right: With sufficient training data, the landscape remains stable and we can recover these (generalizable) features (Thm. \ref{['thm:dataforgeneralization']}), with insufficient data, the landscape changes substantially and local maxima becomes memorization (Thm. \ref{['thm:memorization']}).
  • Figure 5: Generalization/memorization phase transition in modular addition tasks. When $M$ grows, the training data ratio $p = n / M^2$ required to achieve generalization decreases. This coincides with Thm. \ref{['thm:dataforgeneralization']} which predicts $p \sim M^{-1}\log M$ (dotted line). We use learning rate $0.0005$, weight decay $0.0002$ and $K = 2048$. Results averaged over 20 seeds. Top Left: Simple cyclic group $\mathbb{Z}_M$ for prime $M$. Top Right:$\mathbb{Z}_M$ for composite $M$. Bottom Left: Product group $\mathbb{Z}_{4}\otimes \mathbb{Z}_{7}$, $\mathbb{Z}_{5}\otimes \mathbb{Z}_{6}$, $\mathbb{Z}_{2}\otimes \mathbb{Z}_{2} \otimes \mathbb{Z}_{9}$, $\mathbb{Z}_{13}\otimes \mathbb{Z}_{11}$, $\mathbb{Z}_{5}\otimes \mathbb{Z}_{2} \otimes \mathbb{Z}_{2} \otimes \mathbb{Z}_{2}$, $\mathbb{Z}_{6}\otimes \mathbb{Z}_{4} \otimes \mathbb{Z}{2}$, $\mathbb{Z}_{3}\otimes \mathbb{Z}_{2} \otimes \mathbb{Z}_{17}$, $\mathbb{Z}_{2}\otimes \mathbb{Z}_{3} \otimes \mathbb{Z}_{3} \otimes \mathbb{Z}_5$. Bottom Right: Non-Abelian groups with $\max_k d_k = 2$ (maximal irreducible dimension $2$). These non-Abelian groups are generated from GAP programs (See Appendix Sec. \ref{['sec:gap']}).
  • ...and 12 more figures

Theorems & Definitions (56)

  • Proposition 1
  • Lemma 1: Structure of backpropagated gradient $G_F$
  • Theorem 1: The energy function $\cE$ for independent feature learning
  • Theorem 2: Local maxima of $\cE$ for group input
  • Corollary 1: Flatness of local maxima of $\cE$ for group input
  • Corollary 2: Modular addition
  • Theorem 3: Target Reconstruction
  • Theorem 4: Amount of samples to maintain local optima
  • Theorem 5: Memorization solution
  • Theorem 6: Repulsion of similar features
  • ...and 46 more