Masked Hard-Attention Transformers Recognize Exactly the Star-Free Languages

Andy Yang; David Chiang; Dana Angluin

Masked Hard-Attention Transformers Recognize Exactly the Star-Free Languages

Andy Yang, David Chiang, Dana Angluin

TL;DR

This work provides exact characterizations of transformer expressivity under hard attention and masking by establishing an equivalence with $LTL$ and the class of $star-free$ regular languages, using $B$-$RASP$ as a practical intermediate representation. By proving mutual translations between BRASP, masked hard-attention transformers, and $LTL$, the authors transfer decades of logical and automata-theoretic results to transformer architectures, enabling depth and position-embedding analyses. Key contributions include a depth-preserving correspondence, a depth hierarchy showing increasing power with layers, and precise results for variant settings such as strict vs. non-strict masking and sinusoidal versus arbitrary position embeddings. The findings illuminate the limits and capabilities of transformer-like models for formal-language tasks, with implications for architectural design and theoretical understanding of expressivity in sequence modeling.

Abstract

The expressive power of transformers over inputs of unbounded size can be studied through their ability to recognize classes of formal languages. In this paper, we establish exact characterizations of transformers with hard attention (in which all attention is focused on exactly one position) and attention masking (in which each position only attends to positions on one side). With strict masking (each position cannot attend to itself) and without position embeddings, these transformers are expressively equivalent to linear temporal logic (LTL), which defines exactly the star-free languages. A key technique is the use of Boolean RASP as a convenient intermediate language between transformers and LTL. We then take numerous results known for LTL and apply them to transformers, showing how position embeddings, strict masking, and depth all increase expressive power.

Masked Hard-Attention Transformers Recognize Exactly the Star-Free Languages

TL;DR

This work provides exact characterizations of transformer expressivity under hard attention and masking by establishing an equivalence with

and the class of

regular languages, using

as a practical intermediate representation. By proving mutual translations between BRASP, masked hard-attention transformers, and

, the authors transfer decades of logical and automata-theoretic results to transformer architectures, enabling depth and position-embedding analyses. Key contributions include a depth-preserving correspondence, a depth hierarchy showing increasing power with layers, and precise results for variant settings such as strict vs. non-strict masking and sinusoidal versus arbitrary position embeddings. The findings illuminate the limits and capabilities of transformer-like models for formal-language tasks, with implications for architectural design and theoretical understanding of expressivity in sequence modeling.

Abstract

Paper Structure (38 sections, 29 theorems, 29 equations, 5 figures)

This paper contains 38 sections, 29 theorems, 29 equations, 5 figures.

Introduction
Background
Preliminaries
Transformer variants
Previous work
Boolean RASP
Definition
Example: Dyck-1 of depth 2
Additional $\textbf{\upshape B-RASP}$ Example: Associative Recall
Normal forms
Proofs for \ref{['sec:forasp']} (Boolean RASP)
Unary value predicate
Unary score predicate
Equivalence with linear temporal logic
Proof of \ref{['thm:ltl_to_forasp']} ( LTL to B-RASP)
...and 23 more sections

Key Result

Proposition 1

Every B-RASP program is equivalent to one in which all value predicates $V(i,j)$ depend only on $j$.

Figures (5)

Figure 1: Overview of results in this paper. One-way arrows denote strict inclusion; two-way arrows denote equivalence. PE = position embedding.
Figure 2: DFA recognizing $L_{1,2}$.
Figure 3: Boolean vectors for membership of string ${\mathit{\ell}}{\mathit{\ell}}{\mathit{r}}{\mathit{r}}{\mathit{\ell}}{\mathit{\ell}}{\mathit{r}}{\mathit{\ell}}{\mathit{r}}{\mathit{r}}$ in $L_{1,2}$.
Figure 4: Boolean vectors for non-membership of string ${\mathit{\ell}}{\mathit{r}}{\mathit{r}}{\mathit{\ell}}{\mathit{\ell}}{\mathit{\ell}}{\mathit{r}}{\mathit{r}}{\mathit{r}}{\mathit{\ell}}$ in $L_{1,2}$.
Figure 6: Example automaton and its cascade decomposition.

Theorems & Definitions (29)

Proposition 1
Lemma 2
Theorem 3
Lemma 4
Theorem 5
Lemma 6
Theorem 7
Theorem 8: Maler2010, Theorem 3
Lemma 9
Lemma 10
...and 19 more

Masked Hard-Attention Transformers Recognize Exactly the Star-Free Languages

TL;DR

Abstract

Masked Hard-Attention Transformers Recognize Exactly the Star-Free Languages

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (29)