Table of Contents
Fetching ...

What Formal Languages Can Transformers Express? A Survey

Lena Strobl, William Merrill, Gail Weiss, David Chiang, Dana Angluin

TL;DR

This survey investigates the theoretical capabilities of transformers when inputs and outputs are treated as formal languages. By organizing results around automata, circuit, and logic frameworks, it clarifies how architectural choices (PEs, masking, norm placement), precision, and the presence of intermediate steps drive expressivity—from AC^0-like encoders to TC^0 upper bounds and up to Turing-complete decoders. It highlights three regimes: encoder-only variants with hard attention that cap at AC^0, average-hard/softmax attention that likely reach TC^0, and decoders with intermediate steps that achieve Turing-completeness. The work also connects practical questions (chain-of-thought, BOS/CLS tokens) to formal models, providing a roadmap for future precise characterizations and-guide to design choices. Overall, it offers a unified lens to compare transformer variants and their limits against classical computational theories.

Abstract

As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help clarify the power of transformers relative to other models of computation, their fundamental capabilities and limits, and the impact of architectural choices. Work in this subarea has made considerable progress in recent years. Here, we undertake a comprehensive survey of this work, documenting the diverse assumptions that underlie different results and providing a unified framework for harmonizing seemingly contradictory findings.

What Formal Languages Can Transformers Express? A Survey

TL;DR

This survey investigates the theoretical capabilities of transformers when inputs and outputs are treated as formal languages. By organizing results around automata, circuit, and logic frameworks, it clarifies how architectural choices (PEs, masking, norm placement), precision, and the presence of intermediate steps drive expressivity—from AC^0-like encoders to TC^0 upper bounds and up to Turing-complete decoders. It highlights three regimes: encoder-only variants with hard attention that cap at AC^0, average-hard/softmax attention that likely reach TC^0, and decoders with intermediate steps that achieve Turing-completeness. The work also connects practical questions (chain-of-thought, BOS/CLS tokens) to formal models, providing a roadmap for future precise characterizations and-guide to design choices. Overall, it offers a unified lens to compare transformer variants and their limits against classical computational theories.

Abstract

As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help clarify the power of transformers relative to other models of computation, their fundamental capabilities and limits, and the impact of architectural choices. Work in this subarea has made considerable progress in recent years. Here, we undertake a comprehensive survey of this work, documenting the diverse assumptions that underlie different results and providing a unified framework for harmonizing seemingly contradictory findings.
Paper Structure (52 sections, 23 equations, 1 figure, 1 table)

This paper contains 52 sections, 23 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Relationship of some languages and language classes discussed in this paper (right) to the Chomsky hierarchy (left), assuming that $\mathsf{TC}^0 \subsetneq \mathsf{NC}^1$ and $\mathsf{L} \subsetneq \mathsf{NL}$. Circuit classes are $\mathsf{DLOGTIME}$-uniform.