Provable Benefits of Sinusoidal Activation for Modular Addition
Tianlong Huang, Zhiyuan Li
TL;DR
This work analyzes how activation choice affects learning modular addition with a shared embedding, revealing a sharp expressivity gap that favors sinusoidal activations. It provides both underparameterized and overparameterized generalization bounds—via Natarajan-dimension-based uniform convergence and width-independent margin guarantees—and shows constant-width sine networks can realize exact modular sums across all lengths, unlike ReLU nets. Empirical results across MLPs and Transformers confirm sine activations yield better sample efficiency, stronger length generalization, and margin-gen function alignment, with bias further boosting robustness on out-of-domain lengths. The findings advocate for explicit periodic inductive bias when the target structure is inherently periodic, enabling compact representations and improved extrapolation across architectures. Overall, the paper connects expressivity, generalization theory, and empirical performance to argue for sinusoidal activations as a provably beneficial design for modular arithmetic tasks and related periodic problems.
Abstract
This paper studies the role of activation functions in learning modular addition with two-layer neural networks. We first establish a sharp expressivity gap: sine MLPs admit width-$2$ exact realizations for any fixed length $m$ and, with bias, width-$2$ exact realizations uniformly over all lengths. In contrast, the width of ReLU networks must scale linearly with $m$ to interpolate, and they cannot simultaneously fit two lengths with different residues modulo $p$. We then provide a novel Natarajan-dimension generalization bound for sine networks, yielding nearly optimal sample complexity $\widetilde{\mathcal{O}}(p)$ for ERM over constant-width sine networks. We also derive width-independent, margin-based generalization for sine networks in the overparametrized regime and validate it. Empirically, sine networks generalize consistently better than ReLU networks across regimes and exhibit strong length extrapolation.
