Table of Contents
Fetching ...

Emergent Stack Representations in Modeling Counter Languages Using Transformers

Utkarsh Tiwari, Aviral Gupta, Michael Hahn

TL;DR

The study investigates how Transformer models trained on counter languages like Dyck-1 and Shuffle-k develop internal, stack-like representations during next-token prediction. It uses probing classifiers to recover per-token stack-depth signals from internal embeddings, finding strong evidence of emergent stack structures, particularly as the number of required stacks increases. The results bridge formal language theory and mechanistic interpretability, suggesting Transformers can implicitly simulate pushdown memory without explicit architectural constraints. This has implications for understanding algorithmic learning in language models and informs circuit discovery and interpretability research.

Abstract

Transformer architectures are the backbone of most modern language models, but understanding the inner workings of these models still largely remains an open problem. One way that research in the past has tackled this problem is by isolating the learning capabilities of these architectures by training them over well-understood classes of formal languages. We extend this literature by analyzing models trained over counter languages, which can be modeled using counter variables. We train transformer models on 4 counter languages, and equivalently formulate these languages using stacks, whose depths can be understood as the counter values. We then probe their internal representations for stack depths at each input token to show that these models when trained as next token predictors learn stack-like representations. This brings us closer to understanding the algorithmic details of how transformers learn languages and helps in circuit discovery.

Emergent Stack Representations in Modeling Counter Languages Using Transformers

TL;DR

The study investigates how Transformer models trained on counter languages like Dyck-1 and Shuffle-k develop internal, stack-like representations during next-token prediction. It uses probing classifiers to recover per-token stack-depth signals from internal embeddings, finding strong evidence of emergent stack structures, particularly as the number of required stacks increases. The results bridge formal language theory and mechanistic interpretability, suggesting Transformers can implicitly simulate pushdown memory without explicit architectural constraints. This has implications for understanding algorithmic learning in language models and informs circuit discovery and interpretability research.

Abstract

Transformer architectures are the backbone of most modern language models, but understanding the inner workings of these models still largely remains an open problem. One way that research in the past has tackled this problem is by isolating the learning capabilities of these architectures by training them over well-understood classes of formal languages. We extend this literature by analyzing models trained over counter languages, which can be modeled using counter variables. We train transformer models on 4 counter languages, and equivalently formulate these languages using stacks, whose depths can be understood as the counter values. We then probe their internal representations for stack depths at each input token to show that these models when trained as next token predictors learn stack-like representations. This brings us closer to understanding the algorithmic details of how transformers learn languages and helps in circuit discovery.

Paper Structure

This paper contains 16 sections, 7 equations, 19 figures.

Figures (19)

  • Figure 1: High probing accuracy for the counter values at each token on a model trained to recognize counter languages as a next token predictor shows that the models learn a stack-like structures in their internal representation which is helpful in reasoning about and interpreting the learned algorithm to achieve this task.
  • Figure 2: The model is trained to predict the set of all valid next tokens.
  • Figure 3: Probing accuracy across model architectures for different stack depths. Blue lines show task validation accuracy, while red lines represent a randomized control baseline. High task accuracy with low control accuracy indicates successful learning of stack-like structures.
  • Figure 4: A graphical representation of a 1-counter machine that accepts Dyck-1 if we set $F$ to verify that the counter is 0 and we are in q1.
  • Figure 5: Behavior of the counter machine in \ref{['fig:automaton']} on $(())$ (top) and $(()($ (bottom)
  • ...and 14 more figures