Table of Contents
Fetching ...

Algorithmic Phase Transitions in Language Models: A Mechanistic Case Study of Arithmetic

Alan Sun, Ethan Sun, Warren Shepard

TL;DR

This work addresses why zero-shot reasoning varies across tasks by introducing algorithmic stability and algorithmic phase transitions, then applying a mechanistic interpretability workflow using activation patching to Gemma-2-2b across all $m,n$-digit two-operand additions ($m,n \in \{1,\dots,8\}$, $|T|=64$). It finds sharp transitions where the model switches subcircuits between symmetric, boundary, and interior task classes, linking instability to poor generalization in arithmetic reasoning. The study provides a concrete method to identify minimal subcircuits driving task solutions and shows how task perturbations induce distinct computational mechanisms. These insights offer diagnostic tools for assessing and improving extrapolative and logical reasoning capabilities in transformer-based models.

Abstract

Zero-shot capabilities of large language models make them powerful tools for solving a range of tasks without explicit training. It remains unclear, however, how these models achieve such performance, or why they can zero-shot some tasks but not others. In this paper, we shed some light on this phenomenon by defining and investigating algorithmic stability in language models -- changes in problem-solving strategy employed by the model as a result of changes in task specification. We focus on a task where algorithmic stability is needed for generalization: two-operand arithmetic. Surprisingly, we find that Gemma-2-2b employs substantially different computational models on closely related subtasks, i.e. four-digit versus eight-digit addition. Our findings suggest that algorithmic instability may be a contributing factor to language models' poor zero-shot performance across certain logical reasoning tasks, as they struggle to abstract different problem-solving strategies and smoothly transition between them.

Algorithmic Phase Transitions in Language Models: A Mechanistic Case Study of Arithmetic

TL;DR

This work addresses why zero-shot reasoning varies across tasks by introducing algorithmic stability and algorithmic phase transitions, then applying a mechanistic interpretability workflow using activation patching to Gemma-2-2b across all -digit two-operand additions (, ). It finds sharp transitions where the model switches subcircuits between symmetric, boundary, and interior task classes, linking instability to poor generalization in arithmetic reasoning. The study provides a concrete method to identify minimal subcircuits driving task solutions and shows how task perturbations induce distinct computational mechanisms. These insights offer diagnostic tools for assessing and improving extrapolative and logical reasoning capabilities in transformer-based models.

Abstract

Zero-shot capabilities of large language models make them powerful tools for solving a range of tasks without explicit training. It remains unclear, however, how these models achieve such performance, or why they can zero-shot some tasks but not others. In this paper, we shed some light on this phenomenon by defining and investigating algorithmic stability in language models -- changes in problem-solving strategy employed by the model as a result of changes in task specification. We focus on a task where algorithmic stability is needed for generalization: two-operand arithmetic. Surprisingly, we find that Gemma-2-2b employs substantially different computational models on closely related subtasks, i.e. four-digit versus eight-digit addition. Our findings suggest that algorithmic instability may be a contributing factor to language models' poor zero-shot performance across certain logical reasoning tasks, as they struggle to abstract different problem-solving strategies and smoothly transition between them.

Paper Structure

This paper contains 10 sections, 3 equations, 5 figures.

Figures (5)

  • Figure 1: (a) Accuracy of Gemma-2-2b on two-operand addition as the number of digits in both operands vary. Accuracy is measured through exact string match. (b) t-SNE plot (with perplexity=3) where each point is the algorithm Gemma-2-2b implements for a $m,n$-digit addition problem. We capture algorithmic similarity measuring the importance of each attention head. (c) Sample circuits from $8,1$-, $6,3$-, and $4,4$-digit addition, from left to right. These tasks are representative of the phases we identify: symmetry, interior, and boundary.
  • Figure 2: For a fixed model, Gemma-2-2b, this illustrates the pairwise-Pearson correlations between the attention head contributions found through activation patching. Each cell intuitively captures the pairwise subcircuit similarity between subtasks (see the $x,y$-axes labels).
  • Figure 3: For each subtask, each heatmap represents the probability that that particular attention head will be in the top 5% of the most influential attention heads. We split the boundary subtask into two cases: where $|\text{Opr}_1| > |\text{Opr}_2|$ and vice versa.
  • Figure 4: Circuits found for subtasks on the boundary. The attention heads shown are the top 10% of the most influential heads with respect to our patching metric. Since we do not patch any of the MLP layers, some of them are simply omitted from the graphs for brevity.
  • Figure 5: Circuits for the subtasks that are considered to be symmetric. The attention heads are the top 10% of the most influential heads with respect to our patching metric. Between any two attention heads from different layers, if there are not other influential attention heads between them we omit showing all of the MLPs between them for brevity sake. However, we do not patch or remove the MLPs.

Theorems & Definitions (3)

  • Definition 1: Algorithmic Stability
  • Definition 2: Algorithmic Phase Transition
  • Definition 3: Task Partitions