Table of Contents
Fetching ...

Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers

Freya Behrens, Luca Biggio, Lenka Zdeborová

TL;DR

Counting in Small Transformers investigates how simple transformer blocks implement the histogram counting task, revealing two algorithmic strategies—relation-based counting (RC) and inventory-based counting (IC)—whose feasibility depends on architectural biases and dimensions. The study shows that when embedding dimension satisfies $d\ge T$, orthogonal token embeddings enable separable representations and RC via dot-product mixing, while IC relies on a feed-forward look-up with $p\ge T$; for $d< T$, embeddings entangle directions but the discrete nature of counts preserves accuracy, aided by softmax noise reduction. The authors provide explicit constructions and Welch bound-based limits on the required dimensions, and validate the regimes with extensive experiments, including two-layer variants, showing how subtle design choices steer whether attention or FFN handles counting. These insights advance mechanistic interpretability by linking architectural components to concrete counting algorithms and reveal practical implications for designing compact transformers that solve algorithmic tasks with minimal capacity.

Abstract

Next to scaling considerations, architectural design choices profoundly shape the solution space of transformers. In this work, we analyze the solutions simple transformer blocks implement when tackling the histogram task: counting items in sequences. Despite its simplicity, this task reveals a complex interplay between predictive performance, vocabulary and embedding sizes, token-mixing mechanisms, and feed-forward layer capacity. We identify two theoretical counting strategies transformers adopt, relation-based and inventory-based counting, each defining distinct learning regimes for the task. These strategies dictate how functionality is distributed between attention and feed-forward layers. We further show that adding softmax and beginning-of-sequence tokens allow for more robustness when embedding dimensions are comparatively small. Empirical introspection of trained models closely confirms both the learning regimes of the various architectures and the formation of these strategies during training. We demonstrate how a basic task that requires only aggregation and selection is significantly impacted by minor design changes.

Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers

TL;DR

Counting in Small Transformers investigates how simple transformer blocks implement the histogram counting task, revealing two algorithmic strategies—relation-based counting (RC) and inventory-based counting (IC)—whose feasibility depends on architectural biases and dimensions. The study shows that when embedding dimension satisfies , orthogonal token embeddings enable separable representations and RC via dot-product mixing, while IC relies on a feed-forward look-up with ; for , embeddings entangle directions but the discrete nature of counts preserves accuracy, aided by softmax noise reduction. The authors provide explicit constructions and Welch bound-based limits on the required dimensions, and validate the regimes with extensive experiments, including two-layer variants, showing how subtle design choices steer whether attention or FFN handles counting. These insights advance mechanistic interpretability by linking architectural components to concrete counting algorithms and reveal practical implications for designing compact transformers that solve algorithmic tasks with minimal capacity.

Abstract

Next to scaling considerations, architectural design choices profoundly shape the solution space of transformers. In this work, we analyze the solutions simple transformer blocks implement when tackling the histogram task: counting items in sequences. Despite its simplicity, this task reveals a complex interplay between predictive performance, vocabulary and embedding sizes, token-mixing mechanisms, and feed-forward layer capacity. We identify two theoretical counting strategies transformers adopt, relation-based and inventory-based counting, each defining distinct learning regimes for the task. These strategies dictate how functionality is distributed between attention and feed-forward layers. We further show that adding softmax and beginning-of-sequence tokens allow for more robustness when embedding dimensions are comparatively small. Empirical introspection of trained models closely confirms both the learning regimes of the various architectures and the formation of these strategies during training. We demonstrate how a basic task that requires only aggregation and selection is significantly impacted by minor design changes.
Paper Structure (55 sections, 6 theorems, 63 equations, 21 figures)

This paper contains 55 sections, 6 theorems, 63 equations, 21 figures.

Key Result

Proposition 4.1

For bos and bos+sftm and a given $L\geq2$, there each exists a configuration of weights that solves the histogram task at 100% accuracy, given that $d \geq T > 2$ and $p = 1$.

Figures (21)

  • Figure 1: Accuracy on the histogram task for different 1-layer architectures. Mean accuracy for varying embedding size $d$, hidden layer size $p$, for fixed $T=32$ and $L=10$ for the different token mixing mechanisms dot, bos and lin. (Left) Models with softmax; (Right) Models without softmax. Average over $5$ runs for every $d, p\in\{1,2,3,4,6,8,12,16,23,32,45,64,91,128\}$. Vertical and horizontal white lines indicate $p=T$ and $d=T$ respectively. White stars (dots) mark a $100\%$ ($>99\%$) accuracy configuration was found for at least one of the five runs.
  • Figure 2: Test accuracy vs. total number of learned parameters. The data is the same as generated for Fig. \ref{['fig:learning-regimes']}, every data point is the a single experiment and we show the convex hull in solid lines.
  • Figure 3: Relation-based counting with bos+sftm ($T=32,L=10,p=2,d=45$). This model achieves $99.9\%$ accuracy. It was selected as the best model from all our experiments with $p=2$. (Left) The tokens overlap (cosine similarity) with the same tokens, different tokens and the BOS all concentrate around different values. (Middle) This is reflected in the attention matrix after the application of the row-wise softmax. The $t_{\mathrm{BOS}}$ ('$') in the first column $a_{\ell,0}$ becomes a proxy for the count of $x_\ell$. (Right) To demonstrate that the feedforward network is only sensitive to this direction, we show its count predictions for a mix of tokens and fix $\bar{x}'=\alpha e_{\mathrm{BOS}} + (1-\alpha) e_D+ e_B$. The contribution $\alpha$ of the BOS token to the intermediate $\bar{x}'$ is varied and $D,B$ are two specific elements of the alphabet $\mathcal{T}$, represented by their embeddings $e_D$ and $e_B$. The $y$-axis shows the predicted class of the feedforward layer $f(\bar{x}')$ for a given $\alpha$. We measure $a_{\ell,0}$, the actual contribution of the BOS token's for the sequence in the middle plot for the letters that occur $1, 3$ and $5$ times. The same experiment is repeated for different elements of the alphabet in App. \ref{['app::sec:BOSmixing']}, showing independence of the count prediction on the other tokens present in the sequence.
  • Figure 4: Inventory-based counting with dot+sftm$(p=32,d=32)$ and lin+sftm$(p=64,d=64)$). We have $T=32$, $L=10$. The models achieves 99.47% and 100% accuracy respectively. (Left columns) The attention matrix for two different sequences. It differentiates between same (red squares) and different tokens for dot+sftm but is invariant to the semantics for lin. For dot+sftm, any counting direction that could emerge in token space is not informative due to the softmax normalization, so $p\geq T$ is required (see Fig. \ref{['fig:learning-regimes']}). (Right columns) We test the feed-forward layer $f$ in isolation, by feeding it with an artificial mix of learned embeddings for the three tokens $B,C$ and $D$. We plot the class predicted by $f$ for inputs constructed as $\bar{x}' = \alpha_B e_B + \alpha_C e_C + \alpha_D e_D + R$, where $1=\alpha_B +\alpha_C + \alpha_D$ and we change the residual $R \in \{e_B,e_C,e_D\}$ from left to right, for each of the models dot+sftm and lin+sftm. The prediction strongly depends on the coefficient $\alpha_t$ associated with the token $t$ present in the residual connection and only weakly on the others. The non-linear scaling of the decision boundaries is due to the softmax activation function.
  • Figure 5: Introspecting the regime with Entangled Embeddings with dot ($T=32,L=10,p=32$). We show trained instances of dot with varying the model dimension $d$. (Top) The confusion matrix of ground truth and predicted counts. (Bottom) The overlap distribution between same and different token embeddings.
  • ...and 16 more figures

Theorems & Definitions (18)

  • Proposition 4.1: RC with BOS token
  • Proposition 4.2: RC with tagged embeddings
  • Proposition 4.3: IC with memorization in the feed-forward layer
  • Proposition 4.4: Robustness via bounded mutual coherence
  • Proposition 4.5: Robustness via softmax error-reduction
  • proof : Proof of Proposition \ref{['prop:RC-dot']}
  • proof : Proof of Proposition \ref{['prop:RC-with-BOS']} for bos+sftm
  • proof : Proof of Proposition \ref{['prop:RC-with-BOS']} - bos
  • proof : Alternative Proof of Proposition \ref{['prop:RC-with-BOS']} - bos
  • proof : Proof of Proposition \ref{['prop:IC-with-lin']} - lin
  • ...and 8 more