Carrying over algorithm in transformers

Jorrit Kruthoff

Carrying over algorithm in transformers

Jorrit Kruthoff

TL;DR

The paper tackles how Transformer models learn and implement the carrying over algorithm for digit-wise addition, using small encoder-only and decoder-like models on a 3-digit addition task. It reveals a modular implementation where layer 0 performs per-position addition, layer 1 determines where carries are needed via attention, and a final MLP executes the carry, with neuron-level evidence (SVD) and ablations supporting this view. The authors demonstrate length generalisation through priming and finetuning, extending insights to 4- and 6-digit cases, and provide suggestive evidence of similar modular patterns in 7B LLMs (Alpaca, Llemma, Zephyr). The work contributes a mechanistic, interpretable account of arithmetic in transformers, offering practical guidance for enhancing mathematical reasoning and generalisation in large models.

Abstract

Addition is perhaps one of the simplest arithmetic tasks one can think of and is usually performed using the carrying over algorithm. This algorithm consists of two tasks: adding digits in the same position and carrying over a one whenever necessary. We study how transformer models implement this algorithm and how the two aforementioned tasks are allocated to different parts of the network. We first focus on two-layer encoder-only models and show that the carrying over algorithm is implemented in a modular fashion. The first layer is mostly responsible for adding digits in the same position. The second layer first decides, in the attention, which positions need a carried one or not, and then performs the carrying of the one in the final MLP. We provide a simple way of precisely identifying which neurons are responsible for that task. This implementation of the carrying over algorithm occurs across a range of hyperparameters for two as well as three-layer models. For small decoder-only models, we observe the same implementation and provide suggestive evidence for its existence in three 7B large language models.

Carrying over algorithm in transformers

TL;DR

Abstract

Paper Structure (56 sections, 26 figures, 5 tables)

This paper contains 56 sections, 26 figures, 5 tables.

Introduction
Our contributions
Related works
Set-up
Dataset
Models & Training
Methodology
One layer
Phase transitions in the Attention
MLP
Two layers
Attention
MLP
A journey of hidden representations
Layer 0: Determining whether sum $< 10$ or $\geq 10$.
...and 41 more sections

Figures (26)

Figure 1: Summary of two-layer models' implementation of the carrying over algorithm. Note that when we write the addition of two vectors, we mean a linear combination, but for clarity we did not write the coefficients. The light blue indicates $\geq 10$ and darker $<10$. Similarly, the light orange indicates that a carried one needs to be added, whereas for the dark orange it is not necessary.
Figure 2: Left: Loss and norm of the weights as a function of epochs for both the training and test data. Train/test split is $s = 0.3$ and $\lambda = 0.2$. Right: Attention pattern for each head at epoch 50, 200 and 400. There is a distinct pattern after each transition (we checked the transition is indeed sudden), which can happen separately in each head and has the structure so as to add embedding vectors and transfer them to the output positions. The attention patterns are averaged over the test dataset.
Figure 3: Attention pattern for each head and layer for a particular run (of the six). Each column represents one of the five tasks (see Sec. \ref{['sec:setup']}). For the last layer we only plotted the three output positions ($=$). Again we see the staircase patterns for an interaction between the digits ($*$) of each integer. Furthermore, in head 1:0 we see how information from the previous sum gets transferred to the current sum so as to determine whether a carried one is needed or not. It is slightly different for each column in the way one expects. For instance, in the third column, the second position of the outcome gets attention from the sum of digits of the last position of each integer.
Figure 4: PCA for the outputs of the attention and MLP blocks in each layer for the two leading principal axes. First three columns are the first layer and positions $0$, $1$ and $2$ (other positions are not shown as they are similar), the last three columns are the second layer at the positions of the outcome. For rows $0$ and $1$: layer $0$ plots are labelled according to the sum (ignoring any carried one) at that position, layer $1$ plots according to the answer at that position. For rows $2$ and $3$ we labelled according to the tasks discussed in Sec. \ref{['sec:setup']}. We see the first layer determines whether sum $< 10$ or $\geq 10$ and groups those examples (separating also the $9$ as a special case). The second layer instead groups examples according to whether they need a carried one or not. Notice that position $9$ never needs a carried one and so the examples are grouped in the same way as in the first layer.
Figure 5: MLP evolution. Left: Pearsons correlation coefficients of accuracies (corrected and non-corrected) of ablated model with the accuracies expect when no carrying of the one can be performed. Middle: Accuracy for carrying of the one at position 7 (i.e. set $\texttt{C@1}$). Note the corrected accuracy is obtained by adding a one at position 7 to see if it 'forgot' to add a one. Right: Test/train loss. The dashed vertical lines indicates the kink discussed in the main text.
...and 21 more figures

Carrying over algorithm in transformers

TL;DR

Abstract

Carrying over algorithm in transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (26)