A completely uniform transformer for parity
Alexander Kozachinskiy, Tomasz Steifer
TL;DR
The paper tackles recognizing the parity language $L=\{x\in\{0,1\}^* : \sum_i x_i \bmod 2 = 0\}$ with transformers under a constraint of complete length-independence for both parameters and positional encodings. It introduces a completely uniform 3-layer transformer with constant embedding dimension, no length-dependent positional encoding, and no masking or layer norm, improving on prior work that required length-tied encodings. The construction uses a two-layer subroutine to generate a sequence $a_1,\dots,a_n$ with a_\Sigma$-dominant index and a final layer that computes a softmax-weighted parity estimator $\frac{\sum_i e^{a_i f(n)}(-1)^i}{\sum_i e^{a_i f(n)}}$, which converges to $(-1)^{\Sigma}$ as $f(n)$ grows. A key lemma ensures a unique maximizer at $i=\Sigma$ (with $\alpha=1/100$) and the method includes a correction to handle the edge case $\Sigma=0$, establishing that parity can be recognized by a completely uniform transformer with constant embedding dimension.
Abstract
We construct a 3-layer constant-dimension transformer, recognizing the parity language, where neither parameter matrices nor the positional encoding depend on the input length. This improves upon a construction of Chiang and Cholak who use a positional encoding, depending on the input length (but their construction has 2 layers).
