Parity, Sensitivity, and Transformers
Alexander Kozachinskiy, Tomasz Steifer, Przemysław Wałȩga
TL;DR
This work analyzes the expressivity of transformers for the PARITY task, formalizing parity as $\text{PARITY}(x_1,\dots,x_n)=x_1\oplus\cdots\oplus x_n$ and showing a sharp limit on 1-layer, 1-head models via a bound $as(f_n)=O(\sqrt{n})$, which rules out PARITY due to its linear average sensitivity in $n$. It then delivers a constructive advance: a 4-layer transformer implementing PARITY with softmax attention, length-independent and polynomially bounded positional encoding, and no layernorm, functioning under both full-attention and causal masking. The results clarify both a fundamental lower bound on single-layer architectures and a practical, robust parity-capable transformer that aligns better with standard training regimes. Together, they advance theoretical understanding of transformer expressivity and inform architecture design for sensitive tasks like parity computation.
Abstract
The transformer architecture is almost a decade old. Despite that, we still have a limited understanding of what this architecture can or cannot compute. For instance, can a 1-layer transformer solve PARITY -- or more generally -- which kinds of transformers can do it? Known constructions for PARITY have at least 2 layers and employ impractical features: either a length-dependent positional encoding, or hardmax, or layernorm without the regularization parameter, or they are not implementable with causal masking. We give a new construction of a transformer for PARITY with softmax, length-independent and polynomially bounded positional encoding, no layernorm, working both with and without causal masking. We also give the first lower bound for transformers solving PARITY -- by showing that it cannot be done with only one layer and one head.
