Table of Contents
Fetching ...

A completely uniform transformer for parity

Alexander Kozachinskiy, Tomasz Steifer

TL;DR

The paper tackles recognizing the parity language $L=\{x\in\{0,1\}^* : \sum_i x_i \bmod 2 = 0\}$ with transformers under a constraint of complete length-independence for both parameters and positional encodings. It introduces a completely uniform 3-layer transformer with constant embedding dimension, no length-dependent positional encoding, and no masking or layer norm, improving on prior work that required length-tied encodings. The construction uses a two-layer subroutine to generate a sequence $a_1,\dots,a_n$ with a_\Sigma$-dominant index and a final layer that computes a softmax-weighted parity estimator $\frac{\sum_i e^{a_i f(n)}(-1)^i}{\sum_i e^{a_i f(n)}}$, which converges to $(-1)^{\Sigma}$ as $f(n)$ grows. A key lemma ensures a unique maximizer at $i=\Sigma$ (with $\alpha=1/100$) and the method includes a correction to handle the edge case $\Sigma=0$, establishing that parity can be recognized by a completely uniform transformer with constant embedding dimension.

Abstract

We construct a 3-layer constant-dimension transformer, recognizing the parity language, where neither parameter matrices nor the positional encoding depend on the input length. This improves upon a construction of Chiang and Cholak who use a positional encoding, depending on the input length (but their construction has 2 layers).

A completely uniform transformer for parity

TL;DR

The paper tackles recognizing the parity language with transformers under a constraint of complete length-independence for both parameters and positional encodings. It introduces a completely uniform 3-layer transformer with constant embedding dimension, no length-dependent positional encoding, and no masking or layer norm, improving on prior work that required length-tied encodings. The construction uses a two-layer subroutine to generate a sequence with a_\Sigma\frac{\sum_i e^{a_i f(n)}(-1)^i}{\sum_i e^{a_i f(n)}}(-1)^{\Sigma}f(n)i=\Sigma\alpha=1/100\Sigma=0$, establishing that parity can be recognized by a completely uniform transformer with constant embedding dimension.

Abstract

We construct a 3-layer constant-dimension transformer, recognizing the parity language, where neither parameter matrices nor the positional encoding depend on the input length. This improves upon a construction of Chiang and Cholak who use a positional encoding, depending on the input length (but their construction has 2 layers).
Paper Structure (3 sections, 3 theorems, 19 equations)

This paper contains 3 sections, 3 theorems, 19 equations.

Key Result

Lemma 1

For any function $f\colon\mathbb{N}\to\mathbb{R}$, there exists a completely uniform transformer that, for any input length $n$, computes $f(n)$ in every position in one layer.

Theorems & Definitions (7)

  • Definition 1
  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Lemma 2
  • proof : Proof of Lemma \ref{['super_lemma']}