Table of Contents
Fetching ...

On the Ability and Limitations of Transformers to Recognize Formal Languages

Satwik Bhattamishra, Kabir Ahuja, Navin Goyal

TL;DR

Transformers can express and learn certain counter languages, such as Shuffle-Dyck and n-ary Boolean expressions, by using self-attention to emulate counter operations. The paper provides constructive embeddings showing how attention patterns implement counting and reveals that Transformers generalize well on these languages but fail on broader regular languages that require periodicity or modular counting. It shows that positional encodings and depth (number of layers) critically influence learning and generalization, with single-layer models sometimes performing but failing on resets, parity, and higher dot-depth star-free languages. The results highlight fundamental differences between Transformer architectures and LSTMs in modeling formal languages and suggest directions for encoding schemes and architecture tweaks to bridge the gap.

Abstract

Transformers have supplanted recurrent models in a large number of NLP tasks. However, the differences in their abilities to model different syntactic properties remain largely unknown. Past works suggest that LSTMs generalize very well on regular languages and have close connections with counter languages. In this work, we systematically study the ability of Transformers to model such languages as well as the role of its individual components in doing so. We first provide a construction of Transformers for a subclass of counter languages, including well-studied languages such as n-ary Boolean Expressions, Dyck-1, and its generalizations. In experiments, we find that Transformers do well on this subclass, and their learned mechanism strongly correlates with our construction. Perhaps surprisingly, in contrast to LSTMs, Transformers do well only on a subset of regular languages with degrading performance as we make languages more complex according to a well-known measure of complexity. Our analysis also provides insights on the role of self-attention mechanism in modeling certain behaviors and the influence of positional encoding schemes on the learning and generalization abilities of the model.

On the Ability and Limitations of Transformers to Recognize Formal Languages

TL;DR

Transformers can express and learn certain counter languages, such as Shuffle-Dyck and n-ary Boolean expressions, by using self-attention to emulate counter operations. The paper provides constructive embeddings showing how attention patterns implement counting and reveals that Transformers generalize well on these languages but fail on broader regular languages that require periodicity or modular counting. It shows that positional encodings and depth (number of layers) critically influence learning and generalization, with single-layer models sometimes performing but failing on resets, parity, and higher dot-depth star-free languages. The results highlight fundamental differences between Transformer architectures and LSTMs in modeling formal languages and suggest directions for encoding schemes and architecture tweaks to bridge the gap.

Abstract

Transformers have supplanted recurrent models in a large number of NLP tasks. However, the differences in their abilities to model different syntactic properties remain largely unknown. Past works suggest that LSTMs generalize very well on regular languages and have close connections with counter languages. In this work, we systematically study the ability of Transformers to model such languages as well as the role of its individual components in doing so. We first provide a construction of Transformers for a subclass of counter languages, including well-studied languages such as n-ary Boolean Expressions, Dyck-1, and its generalizations. In experiments, we find that Transformers do well on this subclass, and their learned mechanism strongly correlates with our construction. Perhaps surprisingly, in contrast to LSTMs, Transformers do well only on a subset of regular languages with degrading performance as we make languages more complex according to a well-known measure of complexity. Our analysis also provides insights on the role of self-attention mechanism in modeling certain behaviors and the influence of positional encoding schemes on the learning and generalization abilities of the model.

Paper Structure

This paper contains 23 sections, 5 theorems, 9 equations, 6 figures, 9 tables.

Key Result

Proposition 4.1

There exists a Transformer as defined in Section sec:def that can recognize the family of languages Shuffle-Dyck.

Figures (6)

  • Figure 1: Counter languages form a strict superset of regular languages, and are a strict subset of context-sensitive languages. Counter and context-free languages have a nonempty intersection and neither set is contained in the other.
  • Figure 2: Values of different coordinates of the output of self-attention block of the models trained on Shuffle-2 and BoolExp-$3$. The dotted lines are the scaled depth to length ratios for Shuffle-2 and scaled counter value to length ratios for BoolExp-$3$. We observe a near perfect Pearson correlation coefficent of 0.99 between outputs of self attention block and the DL and CL ratios.
  • Figure 3: Plot of value vectors of transformer based models trained on Shuffle-2 \ref{['fig:shuffvals']} and Boolean-3 language \ref{['fig:boolvals']}. The Shuffle-2 model had a hidden size of 8 and boolean-3 model had a hidden size of 3. The x-axis corresponds to different components of the value vectors for both models. Shuffle-2 language consisted of square and round brackets, while for Boolean-3 we considered 3 operators namely: $\sim$ a unary operator, $+$ a binary operator and finally, $>$ which is a ternary operator..
  • Figure 4: Attention maps for models trained on Shuffle-2 and Boolean-3 languages. Similar to our constructions for recognizing these languages, we observe nearly uniform attention weights in both cases
  • Figure 5: Values of four different coordinates of the output of self-attention block. The model is trained to recognize Shuffle-4. The dotted lines are the scaled depth to length ratio for the four types of bracket provided for reference.
  • ...and 1 more figures

Theorems & Definitions (12)

  • Proposition 4.1
  • proof
  • Definition B.1: General counter machine fischer1968counter
  • Definition C.1: Simplified and Stateless counter machine
  • Lemma C.1
  • proof
  • Lemma C.2
  • proof
  • Lemma C.3
  • proof
  • ...and 2 more