Grokking Modular Polynomials

Darshil Doshi; Tianyu He; Aritra Das; Andrey Gromov

Grokking Modular Polynomials

Darshil Doshi, Tianyu He, Aritra Das, Andrey Gromov

TL;DR

The paper addresses the limited generalization of neural nets on modular arithmetic and extends the existing analytic solution for 2-layer MLPs on modular addition to modular multiplication and multi-term sums, using bijective maps on $GF(p)$. It derives explicit weight forms for wide networks, demonstrates that real networks grok these tasks with similar periodic weights, and proposes an expert-based construction to generalize to arbitrary modular polynomials, hinting at a learnable-vs-nonlearnable taxonomy. The contributions include closed-form weight constructions for addition and multiplication, empirical demonstrations of grokking-aligned weight structure, and a Mixture-of-Experts style framework for generalization beyond standard architectures. This work advances understanding of neural generalization in modular reasoning and points toward architecturally specialized solutions for cryptography- and number-theory-relevant problems.

Abstract

Neural networks readily learn a subset of the modular arithmetic tasks, while failing to generalize on the rest. This limitation remains unmoved by the choice of architecture and training strategies. On the other hand, an analytical solution for the weights of Multi-layer Perceptron (MLP) networks that generalize on the modular addition task is known in the literature. In this work, we (i) extend the class of analytical solutions to include modular multiplication as well as modular addition with many terms. Additionally, we show that real networks trained on these datasets learn similar solutions upon generalization (grokking). (ii) We combine these "expert" solutions to construct networks that generalize on arbitrary modular polynomials. (iii) We hypothesize a classification of modular polynomials into learnable and non-learnable via neural networks training; and provide experimental evidence supporting our claims.

Grokking Modular Polynomials

TL;DR

. It derives explicit weight forms for wide networks, demonstrates that real networks grok these tasks with similar periodic weights, and proposes an expert-based construction to generalize to arbitrary modular polynomials, hinting at a learnable-vs-nonlearnable taxonomy. The contributions include closed-form weight constructions for addition and multiplication, empirical demonstrations of grokking-aligned weight structure, and a Mixture-of-Experts style framework for generalization beyond standard architectures. This work advances understanding of neural generalization in modular reasoning and points toward architecturally specialized solutions for cryptography- and number-theory-relevant problems.

Abstract

Paper Structure (15 sections, 17 equations, 3 figures)

This paper contains 15 sections, 17 equations, 3 figures.

Introduction
Modular Addition with Many Terms
Analytical solution
Comparison with trained networks
Modular Multiplication
Exponential and logarithmic maps over finite fields
Analytical solution
Comparison with trained networks
Arbitrary Modular Polynomials
Discussion
Analytical solutions
Modular multiplication
Modular addition with many terms
Performance on arbitrary modular polynomials
Training on learnable and non-learnable modular polynomials

Figures (3)

Figure 1: Modular addition with many terms -- analytical solution applied to real 2-layer MLP networks ($p=23$). The solution equation \ref{['eq:solution_multisum']} works for sufficiently wide networks. Note that the x-axis is scaled logarithmically; which suggests an exponential increase in the required width upon adding more terms (as expected). The accuracies shown are calculated over a randomly chosen subset of the entire dataset, consisting of 10k examples. The results shown are the best out of 10 random seeds.
Figure 2: Training on modular addition with many terms ($(n_1+n_2+n_3+n_4)\, \mathrm{mod} \,p$). $p=11; N=5000$; Adam optimizer; learning rate $=0.005$; weight decay $=5.0$; $50\%$ of the dataset used for training. (a) MSE loss on train and test dataset. (b) Accuracy on train and test dataset as well as average IPR of the network $\overline{\mathrm{IPR}}$. The training curves show the well-known grokking phenomenon; while $\overline{\mathrm{IPR}}$ monotonically increases. (c) Initial and final IPR distributions, evidently showing periodic neurons in the grokked network, confirming the similarity to equation \ref{['eq:solution_multisum']}. Note that the IPR for the analytical solution (equation \ref{['eq:solution_multisum']}) is 1.
Figure 3: Training on modular multiplication ($n_1n_2\, \mathrm{mod} \,p$). $p=97; N=500$; Adam optimizer; learning rate $=0.005$; weight decay $=5.0$; $50\%$ of the dataset used for training. (a) MSE loss on train and test datset. (b) Accuracy on train and test dataset as well as average IPR of the network $\overline{\mathrm{IPR}}$. The training curves show the well-known grokking phenomenon; while $\overline{\mathrm{IPR}}$ monotonically increases. (c) Initial and final IPR distributions, evidently showing periodic neurons in the grokked network, confirming the similarity to equation \ref{['eq:solution_mul']}. Note that the IPR for the analytical solution (equation \ref{['eq:solution_mul']}) is 1.

Grokking Modular Polynomials

TL;DR

Abstract

Grokking Modular Polynomials

Authors

TL;DR

Abstract

Table of Contents

Figures (3)