Parity, Sensitivity, and Transformers

Alexander Kozachinskiy; Tomasz Steifer; Przemysław Wałȩga

Parity, Sensitivity, and Transformers

Alexander Kozachinskiy, Tomasz Steifer, Przemysław Wałȩga

TL;DR

This work analyzes the expressivity of transformers for the PARITY task, formalizing parity as $\text{PARITY}(x_1,\dots,x_n)=x_1\oplus\cdots\oplus x_n$ and showing a sharp limit on 1-layer, 1-head models via a bound $as(f_n)=O(\sqrt{n})$, which rules out PARITY due to its linear average sensitivity in $n$. It then delivers a constructive advance: a 4-layer transformer implementing PARITY with softmax attention, length-independent and polynomially bounded positional encoding, and no layernorm, functioning under both full-attention and causal masking. The results clarify both a fundamental lower bound on single-layer architectures and a practical, robust parity-capable transformer that aligns better with standard training regimes. Together, they advance theoretical understanding of transformer expressivity and inform architecture design for sensitive tasks like parity computation.

Abstract

The transformer architecture is almost a decade old. Despite that, we still have a limited understanding of what this architecture can or cannot compute. For instance, can a 1-layer transformer solve PARITY -- or more generally -- which kinds of transformers can do it? Known constructions for PARITY have at least 2 layers and employ impractical features: either a length-dependent positional encoding, or hardmax, or layernorm without the regularization parameter, or they are not implementable with causal masking. We give a new construction of a transformer for PARITY with softmax, length-independent and polynomially bounded positional encoding, no layernorm, working both with and without causal masking. We also give the first lower bound for transformers solving PARITY -- by showing that it cannot be done with only one layer and one head.

Parity, Sensitivity, and Transformers

TL;DR

This work analyzes the expressivity of transformers for the PARITY task, formalizing parity as

and showing a sharp limit on 1-layer, 1-head models via a bound

, which rules out PARITY due to its linear average sensitivity in

. It then delivers a constructive advance: a 4-layer transformer implementing PARITY with softmax attention, length-independent and polynomially bounded positional encoding, and no layernorm, functioning under both full-attention and causal masking. The results clarify both a fundamental lower bound on single-layer architectures and a practical, robust parity-capable transformer that aligns better with standard training regimes. Together, they advance theoretical understanding of transformer expressivity and inform architecture design for sensitive tasks like parity computation.

Abstract

Paper Structure (10 sections, 10 theorems, 42 equations)

This paper contains 10 sections, 10 theorems, 42 equations.

Introduction
Our results.
Related work.
Organization of the paper.
Transformers
Attention layers.
Transformers.
Sensitivity Lower Bound
A New Transformer for Parity
Proof of Lemma \ref{['lem_crazy']}

Key Result

Theorem 1

Assume that a sequence of Boolean functions $\{f_n\}_{n = 1}^\infty$ is computable by a 1-layer 1-head transformer. Then $as(f_n) = O(\sqrt{n})$ as $n\to\infty$.

Theorems & Definitions (21)

Definition 1
Definition 2
Definition 3
Theorem 1
proof : of Theorem \ref{['thm:1layer_lower']}
Lemma 1
proof
Corollary 1
Lemma 2
Theorem 2
...and 11 more

Parity, Sensitivity, and Transformers

TL;DR

Abstract

Parity, Sensitivity, and Transformers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (21)