Grokking Modular Polynomials
Darshil Doshi, Tianyu He, Aritra Das, Andrey Gromov
TL;DR
The paper addresses the limited generalization of neural nets on modular arithmetic and extends the existing analytic solution for 2-layer MLPs on modular addition to modular multiplication and multi-term sums, using bijective maps on $GF(p)$. It derives explicit weight forms for wide networks, demonstrates that real networks grok these tasks with similar periodic weights, and proposes an expert-based construction to generalize to arbitrary modular polynomials, hinting at a learnable-vs-nonlearnable taxonomy. The contributions include closed-form weight constructions for addition and multiplication, empirical demonstrations of grokking-aligned weight structure, and a Mixture-of-Experts style framework for generalization beyond standard architectures. This work advances understanding of neural generalization in modular reasoning and points toward architecturally specialized solutions for cryptography- and number-theory-relevant problems.
Abstract
Neural networks readily learn a subset of the modular arithmetic tasks, while failing to generalize on the rest. This limitation remains unmoved by the choice of architecture and training strategies. On the other hand, an analytical solution for the weights of Multi-layer Perceptron (MLP) networks that generalize on the modular addition task is known in the literature. In this work, we (i) extend the class of analytical solutions to include modular multiplication as well as modular addition with many terms. Additionally, we show that real networks trained on these datasets learn similar solutions upon generalization (grokking). (ii) We combine these "expert" solutions to construct networks that generalize on arbitrary modular polynomials. (iii) We hypothesize a classification of modular polynomials into learnable and non-learnable via neural networks training; and provide experimental evidence supporting our claims.
