Table of Contents
Fetching ...

Provable Benefits of Sinusoidal Activation for Modular Addition

Tianlong Huang, Zhiyuan Li

TL;DR

This work analyzes how activation choice affects learning modular addition with a shared embedding, revealing a sharp expressivity gap that favors sinusoidal activations. It provides both underparameterized and overparameterized generalization bounds—via Natarajan-dimension-based uniform convergence and width-independent margin guarantees—and shows constant-width sine networks can realize exact modular sums across all lengths, unlike ReLU nets. Empirical results across MLPs and Transformers confirm sine activations yield better sample efficiency, stronger length generalization, and margin-gen function alignment, with bias further boosting robustness on out-of-domain lengths. The findings advocate for explicit periodic inductive bias when the target structure is inherently periodic, enabling compact representations and improved extrapolation across architectures. Overall, the paper connects expressivity, generalization theory, and empirical performance to argue for sinusoidal activations as a provably beneficial design for modular arithmetic tasks and related periodic problems.

Abstract

This paper studies the role of activation functions in learning modular addition with two-layer neural networks. We first establish a sharp expressivity gap: sine MLPs admit width-$2$ exact realizations for any fixed length $m$ and, with bias, width-$2$ exact realizations uniformly over all lengths. In contrast, the width of ReLU networks must scale linearly with $m$ to interpolate, and they cannot simultaneously fit two lengths with different residues modulo $p$. We then provide a novel Natarajan-dimension generalization bound for sine networks, yielding nearly optimal sample complexity $\widetilde{\mathcal{O}}(p)$ for ERM over constant-width sine networks. We also derive width-independent, margin-based generalization for sine networks in the overparametrized regime and validate it. Empirically, sine networks generalize consistently better than ReLU networks across regimes and exhibit strong length extrapolation.

Provable Benefits of Sinusoidal Activation for Modular Addition

TL;DR

This work analyzes how activation choice affects learning modular addition with a shared embedding, revealing a sharp expressivity gap that favors sinusoidal activations. It provides both underparameterized and overparameterized generalization bounds—via Natarajan-dimension-based uniform convergence and width-independent margin guarantees—and shows constant-width sine networks can realize exact modular sums across all lengths, unlike ReLU nets. Empirical results across MLPs and Transformers confirm sine activations yield better sample efficiency, stronger length generalization, and margin-gen function alignment, with bias further boosting robustness on out-of-domain lengths. The findings advocate for explicit periodic inductive bias when the target structure is inherently periodic, enabling compact representations and improved extrapolation across architectures. Overall, the paper connects expressivity, generalization theory, and empirical performance to argue for sinusoidal activations as a provably beneficial design for modular arithmetic tasks and related periodic problems.

Abstract

This paper studies the role of activation functions in learning modular addition with two-layer neural networks. We first establish a sharp expressivity gap: sine MLPs admit width- exact realizations for any fixed length and, with bias, width- exact realizations uniformly over all lengths. In contrast, the width of ReLU networks must scale linearly with to interpolate, and they cannot simultaneously fit two lengths with different residues modulo . We then provide a novel Natarajan-dimension generalization bound for sine networks, yielding nearly optimal sample complexity for ERM over constant-width sine networks. We also derive width-independent, margin-based generalization for sine networks in the overparametrized regime and validate it. Empirically, sine networks generalize consistently better than ReLU networks across regimes and exhibit strong length extrapolation.

Paper Structure

This paper contains 78 sections, 46 theorems, 243 equations, 15 figures, 1 table.

Key Result

Theorem 4.1

For any fixed $m\ge 2$ and $p\ge 2$, there exists a width-$2$ sine network $s^\theta(x)=V\,\sin(Wx)$ that exactly realizes $Y \equiv \Bigl(\sum_{i=1}^m s_i\Bigr)\!\!\pmod p$ for all $x=(s_1,\cdots,s_m)\in\mathcal{X}_m$, i.e., $\mathbb{P}_{(X,Y)\sim \mathcal{D}_m}\![h_\theta(X)=Y]=1.$

Figures (15)

  • Figure 1: Accuracies for two-layer sine and ReLU MLPs in the underparameterized regime.
  • Figure 2: Two-layer sine networks in the overparameterized regime. Clockwise from top left: layer norm, normalized margin, test accuracy, and standard deviation of test accuracy.
  • Figure 3: Two-layer ReLU networks in the overparameterized regime (panels as in Fig. \ref{['fig:overparam1']}).
  • Figure 4: Out-of-domain accuracies of two-layer sine and ReLU MLPs, with no bias; each heatmap cell reports accuracy under the \ref{['reporting_conventions']} scheme.
  • Figure 5: Out-of-domain accuracies for two-layer sine MLPs with and without first-layer bias.
  • ...and 10 more figures

Theorems & Definitions (101)

  • Theorem 4.1: Exact realization at fixed length by a width-$2$ sine network
  • Theorem 4.2: Uniform-in-length expressivity of sine networks
  • Theorem 4.3: Width lower bound for modular addition with ReLU networks
  • Theorem 4.4: Impossibility of exact realization at two incongruent lengths for ReLU networks
  • Example 4.5: Lem. 7.2 AnthonyBartlett2009; see also App. \ref{['tb:sine-disc-unbounded']}
  • Definition 5.1: VC-dimension
  • Definition 5.2: Natarajan-dimension
  • Definition 5.3: Piecewise-polynomial activation
  • Definition 5.4: Trigonometric-polynomial activation
  • Definition 5.5: Polynomial–rational–exponential activation
  • ...and 91 more