Table of Contents
Fetching ...

On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

Jianliang He, Leda Wang, Siyu Chen, Zhuoran Yang

TL;DR

The paper analyzes how two-layer networks solve modular addition by learning Fourier features and demonstrates a diversified, phase-aligned representation across neurons. It formalizes a diversification condition combining frequency diversification and phase symmetry, enabling a majority-voting mechanism that aggregates biased per-neuron signals into the correct modular sum. A lottery-ticket perspective explains how initial spectral magnitudes and phase misalignments determine the winning frequency for each neuron, with a formal ODE-based analysis of gradient flow supporting this view. The grokking phenomenon is characterized as a three-stage process driven by the competition between loss minimization and weight decay, transitioning from memorization to two generalization phases that prune non-feature components and yield sparse Fourier representations. Together, these results provide a principled, mechanistic understanding of feature learning and generalization dynamics in simple neural networks and shed light on broader generalization behavior in neural networks.

Abstract

We present a comprehensive analysis of how two-layer neural networks learn features to solve the modular addition task. Our work provides a full mechanistic interpretation of the learned model and a theoretical explanation of its training dynamics. While prior work has identified that individual neurons learn single-frequency Fourier features and phase alignment, it does not fully explain how these features combine into a global solution. We bridge this gap by formalizing a diversification condition that emerges during training when overparametrized, consisting of two parts: phase symmetry and frequency diversification. We prove that these properties allow the network to collectively approximate a flawed indicator function on the correct logic for the modular addition task. While individual neurons produce noisy signals, the phase symmetry enables a majority-voting scheme that cancels out noise, allowing the network to robustly identify the correct sum. Furthermore, we explain the emergence of these features under random initialization via a lottery ticket mechanism. Our gradient flow analysis proves that frequencies compete within each neuron, with the "winner" determined by its initial spectral magnitude and phase alignment. From a technical standpoint, we provide a rigorous characterization of the layer-wise phase coupling dynamics and formalize the competitive landscape using the ODE comparison lemma. Finally, we use these insights to demystify grokking, characterizing it as a three-stage process involving memorization followed by two generalization phases, driven by the competition between loss minimization and weight decay.

On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

TL;DR

The paper analyzes how two-layer networks solve modular addition by learning Fourier features and demonstrates a diversified, phase-aligned representation across neurons. It formalizes a diversification condition combining frequency diversification and phase symmetry, enabling a majority-voting mechanism that aggregates biased per-neuron signals into the correct modular sum. A lottery-ticket perspective explains how initial spectral magnitudes and phase misalignments determine the winning frequency for each neuron, with a formal ODE-based analysis of gradient flow supporting this view. The grokking phenomenon is characterized as a three-stage process driven by the competition between loss minimization and weight decay, transitioning from memorization to two generalization phases that prune non-feature components and yield sparse Fourier representations. Together, these results provide a principled, mechanistic understanding of feature learning and generalization dynamics in simple neural networks and shed light on broader generalization behavior in neural networks.

Abstract

We present a comprehensive analysis of how two-layer neural networks learn features to solve the modular addition task. Our work provides a full mechanistic interpretation of the learned model and a theoretical explanation of its training dynamics. While prior work has identified that individual neurons learn single-frequency Fourier features and phase alignment, it does not fully explain how these features combine into a global solution. We bridge this gap by formalizing a diversification condition that emerges during training when overparametrized, consisting of two parts: phase symmetry and frequency diversification. We prove that these properties allow the network to collectively approximate a flawed indicator function on the correct logic for the modular addition task. While individual neurons produce noisy signals, the phase symmetry enables a majority-voting scheme that cancels out noise, allowing the network to robustly identify the correct sum. Furthermore, we explain the emergence of these features under random initialization via a lottery ticket mechanism. Our gradient flow analysis proves that frequencies compete within each neuron, with the "winner" determined by its initial spectral magnitude and phase alignment. From a technical standpoint, we provide a rigorous characterization of the layer-wise phase coupling dynamics and formalize the competitive landscape using the ODE comparison lemma. Finally, we use these insights to demystify grokking, characterizing it as a three-stage process involving memorization followed by two generalization phases, driven by the competition between loss minimization and weight decay.
Paper Structure (73 sections, 18 theorems, 203 equations, 25 figures, 3 tables)

This paper contains 73 sections, 18 theorems, 203 equations, 25 figures, 3 tables.

Key Result

Proposition 4.2

Suppose that the neurons are completely diversified as per Definition def:diversification. Under the parametrization in eq:trig_pattern and the phase-alignment condition $2\phi_m-\psi_m=0\bmod{2\pi}$ for all $m\in[M]$, the output logit at dimension $j\in[p]$ takes the form: For any $\epsilon\in(0,1)$, by taking $a\gtrsim(Np)^{-1}\cdot\log(p/\epsilon)$, it holds that $\|{\mathtt{smax}}\circ f(\cdo

Figures (25)

  • Figure 1: An illustration of the primary analytical technique and results. Discrete Fourier Transform (DFT) is utilized to quantitatively interpret the mechanism of learned models within the feature space, revealing the training dynamics that result in consistent feature learning. Figure (a) shows the neural network architecture --- we adopt a two-layer fully connected neural network to learn the modular addition task. The inputs $x$ and $y$ are represented as one-hot vectors in $\mathbb{R}^p$, $\sigma (\cdot )$ denotes the activation function, and the width of the neural network is denoted by $M$. Figure (b) illustrates the technique of DFT. We apply DFT to the weights at the input and output layers, respectively. Each neuron involves two weight vectors, which lead to two magnitudes and phases. (See Observation 1 in §\ref{['sec:experiments']}.) Figure (c) illustrates some of our key empirical observations --- phase alignment (Observation 2), phase symmetry (Observation 3), and lottery ticket mechanism (Observation 6).
  • Figure 2: Heatmap of Learned Parameters.
  • Figure 3: Actual Learned and Fitted Parameters of Each Neuron.
  • Figure 5: Scatter of $(2\phi_m,\psi_m)$.
  • Figure 6: Phase Symmetry within Frequency Group $\mathcal{N}_k$.
  • ...and 20 more figures

Theorems & Definitions (39)

  • Definition 4.1: Full Diversification
  • Proposition 4.2
  • Remark 5.1: Equivalence to Margin Maximization under Small Initialization
  • Theorem 5.2: Informal
  • Theorem 5.3
  • Corollary 6.1
  • Definition 6.2: Frequency Multiplication
  • Proposition 6.3
  • Lemma B.1
  • proof : Proof of Lemma \ref{['lem:softmax_gap']}
  • ...and 29 more