Table of Contents
Fetching ...

Post-Routing Arithmetic in Llama-3: Last-Token Result Writing and Rotation-Structured Digit Directions

Yao Yan

Abstract

We study three-digit addition in Meta-Llama-3-8B (base) under a one-token readout to characterize how arithmetic answers are finalized after cross-token routing becomes causally irrelevant. Causal residual patching and cumulative attention ablations localize a sharp boundary near layer~17: beyond it, the decoded sum is controlled almost entirely by the last input token and late-layer self-attention is largely dispensable. In this post-routing regime, digit(-sum) direction dictionaries vary with a next-higher-digit context but are well-related by an approximately orthogonal map inside a shared low-rank subspace (low-rank Procrustes alignment). Causal digit editing matches this geometry: naive cross-context transfer fails, while rotating directions through the learned map restores strict counterfactual edits; negative controls do not recover.

Post-Routing Arithmetic in Llama-3: Last-Token Result Writing and Rotation-Structured Digit Directions

Abstract

We study three-digit addition in Meta-Llama-3-8B (base) under a one-token readout to characterize how arithmetic answers are finalized after cross-token routing becomes causally irrelevant. Causal residual patching and cumulative attention ablations localize a sharp boundary near layer~17: beyond it, the decoded sum is controlled almost entirely by the last input token and late-layer self-attention is largely dispensable. In this post-routing regime, digit(-sum) direction dictionaries vary with a next-higher-digit context but are well-related by an approximately orthogonal map inside a shared low-rank subspace (low-rank Procrustes alignment). Causal digit editing matches this geometry: naive cross-context transfer fails, while rotating directions through the learned map restores strict counterfactual edits; negative controls do not recover.
Paper Structure (68 sections, 22 equations, 6 figures, 3 tables)

This paper contains 68 sections, 22 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Layerwise context-conditioned dictionary isomorphism (two digit-shifted settings). For each layer $L$, metrics are averaged over off-diagonal context pairs $(c_1,c_2)$. (a) Unaligned cosine decreases with depth, indicating non-invariance of individual bucket directions. (b) After low-rank orthogonal Procrustes alignment (Sec. \ref{['app:alignment_details']}), aligned cosine remains near unity while $\mathrm{RelFro}$ stays small, consistent with a rotation-like re-parameterization of a shared low-rank core. Solid: stripped-tens context $\rightarrow$ ones-sum; dashed: result-hundreds-digit context $\rightarrow$ result-tens digit.
  • Figure 2: Single-layer anchor: context-conditioned rotation at a fixed depth. At a representative post-routing layer, unaligned similarity between context dictionaries varies substantially (top), while low-rank Procrustes alignment collapses similarity to near unity (bottom).
  • Figure 3: Rotation is necessary for cross-context digit editing. Strict counterfactual success vs. layer (baseline-correct subset). rotated recovers toward direct, while transfer collapses; wrong-condition and random-$R$ fail to recover.
  • Figure 4: Edit-magnitude sensitivity across two settings. Strict success vs. $|\Delta|$ (layers 24--26, weighted by $n_{\mathrm{total}}$). Left: $H\!\rightarrow\!t$. Right: $T\!\rightarrow\!o$. transfer degrades with $|\Delta|$, while rotated remains close to direct; random-$R$ and wrong-condition controls do not recover.
  • Figure 5: Cross-prompt transport of digit-editing interventions. (a) Strict edit success vs. layer under the canonical template and three paraphrases. (b) Right: bar plot of answer accuracy under each prompt.
  • ...and 1 more figures