Table of Contents
Fetching ...

Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition

Qinyuan Ye, Robin Jia, Xiang Ren

TL;DR

This work identifies a mechanism that explains the model's generalization from standard addition to off-by-one addition, and shows that the induction of the +1 function is governed by multiple attention heads in parallel, each of which emits a distinct piece of the +1 function.

Abstract

Large language models demonstrate the intriguing ability to perform unseen tasks via in-context learning. However, it remains unclear what mechanisms inside the model drive such task-level generalization. In this work, we approach this question through the lens of off-by-one addition (i.e., 1+1=3, 2+2=5, 3+3=?), a two-step, counterfactual task with an unexpected +1 function as a second step. Leveraging circuit-style interpretability techniques such as path patching, we analyze the models' internal computations behind their performance and present three key findings. First, we identify a mechanism that explains the model's generalization from standard addition to off-by-one addition. It resembles the induction head mechanism described in prior work, yet operates at a higher level of abstraction; we therefore term it "function induction" in this work. Second, we show that the induction of the +1 function is governed by multiple attention heads in parallel, each of which emits a distinct piece of the +1 function. Finally, we find that this function induction mechanism is reused in a broader range of tasks, including synthetic tasks such as shifted multiple-choice QA and algorithmic tasks such as base-8 addition. Overall, our findings offer deeper insights into how reusable and composable structures within language models enable task-level generalization.

Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition

TL;DR

This work identifies a mechanism that explains the model's generalization from standard addition to off-by-one addition, and shows that the induction of the +1 function is governed by multiple attention heads in parallel, each of which emits a distinct piece of the +1 function.

Abstract

Large language models demonstrate the intriguing ability to perform unseen tasks via in-context learning. However, it remains unclear what mechanisms inside the model drive such task-level generalization. In this work, we approach this question through the lens of off-by-one addition (i.e., 1+1=3, 2+2=5, 3+3=?), a two-step, counterfactual task with an unexpected +1 function as a second step. Leveraging circuit-style interpretability techniques such as path patching, we analyze the models' internal computations behind their performance and present three key findings. First, we identify a mechanism that explains the model's generalization from standard addition to off-by-one addition. It resembles the induction head mechanism described in prior work, yet operates at a higher level of abstraction; we therefore term it "function induction" in this work. Second, we show that the induction of the +1 function is governed by multiple attention heads in parallel, each of which emits a distinct piece of the +1 function. Finally, we find that this function induction mechanism is reused in a broader range of tasks, including synthetic tasks such as shifted multiple-choice QA and algorithmic tasks such as base-8 addition. Overall, our findings offer deeper insights into how reusable and composable structures within language models enable task-level generalization.

Paper Structure

This paper contains 63 sections, 1 equation, 36 figures, 6 tables.

Figures (36)

  • Figure 1: In-context Learning Performance of Off-by-One Addition.
  • Figure 2: Circuit Discovery with Gemma-2 (9B).Top: Patching Results on Selected Target Nodes. (a) We identify Group 1 heads and Group 2 heads that directly influence the output logits. (b) We identify Group 3 heads that write to the value of H39.7. Bottom: Attention Pattern of Selected Heads. We use 4 ICL examples in the format of "a+b=c\\ n". Causally-relevant positions are marked in pink. (c)Group 1 heads mainly attend to the current token and <bos>. (d)Group 2 heads attend to the answer tokens ($c_i$) of previous ICL examples at the position of "=". (e)Group 3 heads attend to the preceding "=" at the position of $c_i$.
  • Figure 3: Overview of the Identified Circuit.
  • Figure 4: Head Ablation Results.
  • Figure 5: Individual and Overall Effect of Identified FI Heads. Each head writes out different information, which aggregates to implement the function of $f(x)=x+1$ (bottom-right panel). (*) Effects of H32.6, H25.13, and H32.4 are rescaled to [-0.15, 0.15] to make the patterns more readable.
  • ...and 31 more figures