Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers
Freya Behrens, Luca Biggio, Lenka Zdeborová
TL;DR
Counting in Small Transformers investigates how simple transformer blocks implement the histogram counting task, revealing two algorithmic strategies—relation-based counting (RC) and inventory-based counting (IC)—whose feasibility depends on architectural biases and dimensions. The study shows that when embedding dimension satisfies $d\ge T$, orthogonal token embeddings enable separable representations and RC via dot-product mixing, while IC relies on a feed-forward look-up with $p\ge T$; for $d< T$, embeddings entangle directions but the discrete nature of counts preserves accuracy, aided by softmax noise reduction. The authors provide explicit constructions and Welch bound-based limits on the required dimensions, and validate the regimes with extensive experiments, including two-layer variants, showing how subtle design choices steer whether attention or FFN handles counting. These insights advance mechanistic interpretability by linking architectural components to concrete counting algorithms and reveal practical implications for designing compact transformers that solve algorithmic tasks with minimal capacity.
Abstract
Next to scaling considerations, architectural design choices profoundly shape the solution space of transformers. In this work, we analyze the solutions simple transformer blocks implement when tackling the histogram task: counting items in sequences. Despite its simplicity, this task reveals a complex interplay between predictive performance, vocabulary and embedding sizes, token-mixing mechanisms, and feed-forward layer capacity. We identify two theoretical counting strategies transformers adopt, relation-based and inventory-based counting, each defining distinct learning regimes for the task. These strategies dictate how functionality is distributed between attention and feed-forward layers. We further show that adding softmax and beginning-of-sequence tokens allow for more robustness when embedding dimensions are comparatively small. Empirical introspection of trained models closely confirms both the learning regimes of the various architectures and the formation of these strategies during training. We demonstrate how a basic task that requires only aggregation and selection is significantly impacted by minor design changes.
