Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task
Siavash Golkar, Alberto Bietti, Mariel Pettee, Michael Eickenberg, Miles Cranmer, Keiya Hirashima, Geraud Krawezik, Nicholas Lourie, Michael McCabe, Rudy Morel, Ruben Ohana, Liam Holden Parker, Bruno Régaldo-Saint Blancard, Kyunghyun Cho, Shirley Ho
TL;DR
The paper introduces contextual counting as a quantitative mechanistic probe to study Transformers, formalizing a region-based counting task and examining causal versus non-causal architectures across several positional encodings. It provides theoretical insights showing that a two-layer causal Transformer with NoPE can solve arbitrary sequence lengths and region counts, and it presents detailed empirical analyses of how different position codes affect performance and generalization. The findings show that causal models outperform non-causal ones, with NoPE achieving the best accuracy but higher training variance, while RoPE remains competitive; generalization to out-of-distribution settings depends on which tokens serve as bias terms. Additionally, the work analyzes learned circuits, revealing how encoders tag regional context and how decoders attend to region-specific cues, and it includes preliminary results indicating chain-of-thought prompting can help large language models tackle this task. Overall, the study provides mechanistic explanations for how Transformers handle regional counting and offers guidance for designing quantitative reasoning models with robust generalization capabilities.
Abstract
Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datasets, akin to object detection or region-based scientific analysis. We present theoretical and empirical analysis using both causal and non-causal Transformer architectures, investigating the influence of various positional encodings on performance and interpretability. In particular, we find that causal attention is much better suited for the task, and that no positional embeddings lead to the best accuracy, though rotary embeddings are competitive and easier to train. We also show that out of distribution performance is tightly linked to which tokens it uses as a bias term.
