Table of Contents
Fetching ...

Separations in the Representational Capabilities of Transformers and Recurrent Architectures

Satwik Bhattamishra, Michael Hahn, Phil Blunsom, Varun Kanade

TL;DR

The paper probes the fundamental differences in how Transformers and recurrent architectures represent sequences under finite-precision constraints, focusing on tasks like index lookup, bounded Dyck languages, string equality, and nearest-neighbor search. It demonstrates sharp separations: a one-layer Transformer with poly-logarithmic width can solve index lookup while RNNs require linear width; conversely, Dyck-(2,$2$) demands linear width for one-layer Transformers but constant-width RNNs can handle it; two-layer Transformers with poly-log width can efficiently implement EQ and NN, which elude shallow models. The authors leverage Johnson-Lindenstrauss-based near-orthogonal embeddings and multi-way communication complexity arguments to derive upper and lower bounds, complemented by empirical studies that corroborate the theoretical distinctions. Overall, the results illuminate a nuanced landscape where attention-based models can achieve representational power beyond RNNs in constrained regimes, with practical implications for model design and efficiency.

Abstract

Transformer architectures have been widely adopted in foundation models. Due to their high inference costs, there is renewed interest in exploring the potential of efficient recurrent architectures (RNNs). In this paper, we analyze the differences in the representational capabilities of Transformers and RNNs across several tasks of practical relevance, including index lookup, nearest neighbor, recognizing bounded Dyck languages, and string equality. For the tasks considered, our results show separations based on the size of the model required for different architectures. For example, we show that a one-layer Transformer of logarithmic width can perform index lookup, whereas an RNN requires a hidden state of linear size. Conversely, while constant-size RNNs can recognize bounded Dyck languages, we show that one-layer Transformers require a linear size for this task. Furthermore, we show that two-layer Transformers of logarithmic size can perform decision tasks such as string equality or disjointness, whereas both one-layer Transformers and recurrent models require linear size for these tasks. We also show that a log-size two-layer Transformer can implement the nearest neighbor algorithm in its forward pass; on the other hand recurrent models require linear size. Our constructions are based on the existence of $N$ nearly orthogonal vectors in $O(\log N)$ dimensional space and our lower bounds are based on reductions from communication complexity problems. We supplement our theoretical results with experiments that highlight the differences in the performance of these architectures on practical-size sequences.

Separations in the Representational Capabilities of Transformers and Recurrent Architectures

TL;DR

The paper probes the fundamental differences in how Transformers and recurrent architectures represent sequences under finite-precision constraints, focusing on tasks like index lookup, bounded Dyck languages, string equality, and nearest-neighbor search. It demonstrates sharp separations: a one-layer Transformer with poly-logarithmic width can solve index lookup while RNNs require linear width; conversely, Dyck-(2,) demands linear width for one-layer Transformers but constant-width RNNs can handle it; two-layer Transformers with poly-log width can efficiently implement EQ and NN, which elude shallow models. The authors leverage Johnson-Lindenstrauss-based near-orthogonal embeddings and multi-way communication complexity arguments to derive upper and lower bounds, complemented by empirical studies that corroborate the theoretical distinctions. Overall, the results illuminate a nuanced landscape where attention-based models can achieve representational power beyond RNNs in constrained regimes, with practical implications for model design and efficiency.

Abstract

Transformer architectures have been widely adopted in foundation models. Due to their high inference costs, there is renewed interest in exploring the potential of efficient recurrent architectures (RNNs). In this paper, we analyze the differences in the representational capabilities of Transformers and RNNs across several tasks of practical relevance, including index lookup, nearest neighbor, recognizing bounded Dyck languages, and string equality. For the tasks considered, our results show separations based on the size of the model required for different architectures. For example, we show that a one-layer Transformer of logarithmic width can perform index lookup, whereas an RNN requires a hidden state of linear size. Conversely, while constant-size RNNs can recognize bounded Dyck languages, we show that one-layer Transformers require a linear size for this task. Furthermore, we show that two-layer Transformers of logarithmic size can perform decision tasks such as string equality or disjointness, whereas both one-layer Transformers and recurrent models require linear size for these tasks. We also show that a log-size two-layer Transformer can implement the nearest neighbor algorithm in its forward pass; on the other hand recurrent models require linear size. Our constructions are based on the existence of nearly orthogonal vectors in dimensional space and our lower bounds are based on reductions from communication complexity problems. We supplement our theoretical results with experiments that highlight the differences in the performance of these architectures on practical-size sequences.
Paper Structure (33 sections, 27 theorems, 57 equations, 4 figures)

This paper contains 33 sections, 27 theorems, 57 equations, 4 figures.

Key Result

Theorem 1

For all $N \in \mathbb{N}$, there is a $1$-layer Transformer with width $m= O(\log N)$ and precision $p= O(\log N)$ which performs the index lookup task for all input sequences of lengths up to $N$.

Figures (4)

  • Figure 1: Illustration of a few key tasks considered in our work.
  • Figure 2: Performance of models on the Index Lookup and bounded Dyck task. Labels such as TF-(1, 64) denote Transformers with 1 layer and 64 widths. See Section \ref{['sec:experiments']} for more details.
  • Figure 3: Performance of Mamba on the Index Lookup task across various lengths and widths. See Section \ref{['appsubsec:data_gen']} for more details.
  • Figure 4: Performance of architectures on the Equality task. See Section \ref{['appsubsec:eq_exp']} for more details.

Theorems & Definitions (42)

  • Theorem 1
  • proof : Proof Sketch.
  • Theorem 2
  • proof
  • Theorem 3
  • Theorem 4
  • Lemma 1
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • ...and 32 more