Table of Contents
Fetching ...

Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task

Siavash Golkar, Alberto Bietti, Mariel Pettee, Michael Eickenberg, Miles Cranmer, Keiya Hirashima, Geraud Krawezik, Nicholas Lourie, Michael McCabe, Rudy Morel, Ruben Ohana, Liam Holden Parker, Bruno Régaldo-Saint Blancard, Kyunghyun Cho, Shirley Ho

TL;DR

The paper introduces contextual counting as a quantitative mechanistic probe to study Transformers, formalizing a region-based counting task and examining causal versus non-causal architectures across several positional encodings. It provides theoretical insights showing that a two-layer causal Transformer with NoPE can solve arbitrary sequence lengths and region counts, and it presents detailed empirical analyses of how different position codes affect performance and generalization. The findings show that causal models outperform non-causal ones, with NoPE achieving the best accuracy but higher training variance, while RoPE remains competitive; generalization to out-of-distribution settings depends on which tokens serve as bias terms. Additionally, the work analyzes learned circuits, revealing how encoders tag regional context and how decoders attend to region-specific cues, and it includes preliminary results indicating chain-of-thought prompting can help large language models tackle this task. Overall, the study provides mechanistic explanations for how Transformers handle regional counting and offers guidance for designing quantitative reasoning models with robust generalization capabilities.

Abstract

Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datasets, akin to object detection or region-based scientific analysis. We present theoretical and empirical analysis using both causal and non-causal Transformer architectures, investigating the influence of various positional encodings on performance and interpretability. In particular, we find that causal attention is much better suited for the task, and that no positional embeddings lead to the best accuracy, though rotary embeddings are competitive and easier to train. We also show that out of distribution performance is tightly linked to which tokens it uses as a bias term.

Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task

TL;DR

The paper introduces contextual counting as a quantitative mechanistic probe to study Transformers, formalizing a region-based counting task and examining causal versus non-causal architectures across several positional encodings. It provides theoretical insights showing that a two-layer causal Transformer with NoPE can solve arbitrary sequence lengths and region counts, and it presents detailed empirical analyses of how different position codes affect performance and generalization. The findings show that causal models outperform non-causal ones, with NoPE achieving the best accuracy but higher training variance, while RoPE remains competitive; generalization to out-of-distribution settings depends on which tokens serve as bias terms. Additionally, the work analyzes learned circuits, revealing how encoders tag regional context and how decoders attend to region-specific cues, and it includes preliminary results indicating chain-of-thought prompting can help large language models tackle this task. Overall, the study provides mechanistic explanations for how Transformers handle regional counting and offers guidance for designing quantitative reasoning models with robust generalization capabilities.

Abstract

Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datasets, akin to object detection or region-based scientific analysis. We present theoretical and empirical analysis using both causal and non-causal Transformer architectures, investigating the influence of various positional encodings on performance and interpretability. In particular, we find that causal attention is much better suited for the task, and that no positional embeddings lead to the best accuracy, though rotary embeddings are competitive and easier to train. We also show that out of distribution performance is tightly linked to which tokens it uses as a bias term.
Paper Structure (38 sections, 4 theorems, 4 equations, 25 figures, 2 tables)

This paper contains 38 sections, 4 theorems, 4 equations, 25 figures, 2 tables.

Key Result

Proposition 2.1

(informal) If the regional contextual position information is linearly decodable from the latent representation of the tokens at some layer of a Transformer, the Contextual Counting task can be solved with a single additional layer.

Figures (25)

  • Figure 1: Typical output of a trained model on the Contextual Counting task. The model outputs probability distribution over the number of ones in the relevant region. In this case, regions 0, 2, and 3 predict the correct numbers (given by the dashed line) but region 1 has failed to learn.
  • Figure 2: Model Accuracy on the Contextual Counting task. Here the input sequence is length 512 with 4 regions. The results in the shaded region denote models trained with non-causal attention. RoPE and NoPE outperfrom absolute position code and non-causal models fail to learn. For Alibi see Sec. \ref{['app:alibi']}.
  • Figure 3: Generalization performance on test samples with shorter sequences (T=300). Of the models that perform well in-distribution, only a few generalize to shorter sequence lengths.
  • Figure 4: Generalization performance on test samples with three regions. In this case RoPE generalizes much better than NoPE.
  • Figure 5: Prediction of three different solutions in the original distribution as well as shorter sequences and fewer number of regions. The various solutions types suffer from different failure modes when evaluated on out-of-distribution samples. The difference in behavior can be traced to the use of different tokens as biasing terms.
  • ...and 20 more figures

Theorems & Definitions (6)

  • Proposition 2.1
  • Proposition 2.2
  • Proposition 2.3
  • Proposition 2.4
  • Remark 2.5
  • proof