Do Attention Heads Compete or Cooperate during Counting?
Pál Zsámboki, Ádám Fraknói, Máté Gedeon, András Kornai, Zsolt Zombori
TL;DR
This work uses mechanistic interpretability to analyze how a small, single-layer transformer solves a counting task. It shows that attention heads behave as a pseudo-ensemble for the counting semantics, but the output layer must aggregate their representations non-uniformly to satisfy the syntactic EOS constraint. Through three metrics (l-acc, ROC AUC, s-acc), attention patterns, and attention-intervention experiments, the authors demonstrate that the key factor is how heads attend to tokens (notably balancing $w_{01}$ and suppressing $w_{02}$) and that model performance largely aligns with a linear combination of head outputs. The findings illuminate the interplay between semantic and syntactic requirements in transformers and highlight the value of mechanistic interpretability for understanding head-level contributions in even simple algorithmic tasks.
Abstract
We present an in-depth mechanistic interpretability analysis of training small transformers on an elementary task, counting, which is a crucial deductive step in many algorithms. In particular, we investigate the collaboration/competition among the attention heads: we ask whether the attention heads behave as a pseudo-ensemble, all solving the same subtask, or they perform different subtasks, meaning that they can only solve the original task in conjunction. Our work presents evidence that on the semantics of the counting task, attention heads behave as a pseudo-ensemble, but their outputs need to be aggregated in a non-uniform manner in order to create an encoding that conforms to the syntax. Our source code will be available upon publication.
