A completely uniform transformer for parity

Alexander Kozachinskiy; Tomasz Steifer

A completely uniform transformer for parity

Alexander Kozachinskiy, Tomasz Steifer

TL;DR

The paper tackles recognizing the parity language $L=\{x\in\{0,1\}^* : \sum_i x_i \bmod 2 = 0\}$ with transformers under a constraint of complete length-independence for both parameters and positional encodings. It introduces a completely uniform 3-layer transformer with constant embedding dimension, no length-dependent positional encoding, and no masking or layer norm, improving on prior work that required length-tied encodings. The construction uses a two-layer subroutine to generate a sequence $a_1,\dots,a_n$ with a_\Sigma$-dominant index and a final layer that computes a softmax-weighted parity estimator $\frac{\sum_i e^{a_i f(n)}(-1)^i}{\sum_i e^{a_i f(n)}}$, which converges to $(-1)^{\Sigma}$ as $f(n)$ grows. A key lemma ensures a unique maximizer at $i=\Sigma$ (with $\alpha=1/100$) and the method includes a correction to handle the edge case $\Sigma=0$, establishing that parity can be recognized by a completely uniform transformer with constant embedding dimension.

Abstract

We construct a 3-layer constant-dimension transformer, recognizing the parity language, where neither parameter matrices nor the positional encoding depend on the input length. This improves upon a construction of Chiang and Cholak who use a positional encoding, depending on the input length (but their construction has 2 layers).

A completely uniform transformer for parity

TL;DR

The paper tackles recognizing the parity language

with transformers under a constraint of complete length-independence for both parameters and positional encodings. It introduces a completely uniform 3-layer transformer with constant embedding dimension, no length-dependent positional encoding, and no masking or layer norm, improving on prior work that required length-tied encodings. The construction uses a two-layer subroutine to generate a sequence

with a_\Sigma

\frac{\sum_i e^{a_i f(n)}(-1)^i}{\sum_i e^{a_i f(n)}}

(-1)^{\Sigma}

f(n)

i=\Sigma

\alpha=1/100

\Sigma=0$, establishing that parity can be recognized by a completely uniform transformer with constant embedding dimension.

Abstract

Paper Structure (3 sections, 3 theorems, 19 equations)

This paper contains 3 sections, 3 theorems, 19 equations.

Introduction
Preliminaries
Construction

Key Result

Lemma 1

For any function $f\colon\mathbb{N}\to\mathbb{R}$, there exists a completely uniform transformer that, for any input length $n$, computes $f(n)$ in every position in one layer.

Theorems & Definitions (7)

Definition 1
Lemma 1
proof
Theorem 1
proof
Lemma 2
proof : Proof of Lemma \ref{['super_lemma']}

A completely uniform transformer for parity

TL;DR

Abstract

A completely uniform transformer for parity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (7)