Algorithmic Phase Transitions in Language Models: A Mechanistic Case Study of Arithmetic
Alan Sun, Ethan Sun, Warren Shepard
TL;DR
This work addresses why zero-shot reasoning varies across tasks by introducing algorithmic stability and algorithmic phase transitions, then applying a mechanistic interpretability workflow using activation patching to Gemma-2-2b across all $m,n$-digit two-operand additions ($m,n \in \{1,\dots,8\}$, $|T|=64$). It finds sharp transitions where the model switches subcircuits between symmetric, boundary, and interior task classes, linking instability to poor generalization in arithmetic reasoning. The study provides a concrete method to identify minimal subcircuits driving task solutions and shows how task perturbations induce distinct computational mechanisms. These insights offer diagnostic tools for assessing and improving extrapolative and logical reasoning capabilities in transformer-based models.
Abstract
Zero-shot capabilities of large language models make them powerful tools for solving a range of tasks without explicit training. It remains unclear, however, how these models achieve such performance, or why they can zero-shot some tasks but not others. In this paper, we shed some light on this phenomenon by defining and investigating algorithmic stability in language models -- changes in problem-solving strategy employed by the model as a result of changes in task specification. We focus on a task where algorithmic stability is needed for generalization: two-operand arithmetic. Surprisingly, we find that Gemma-2-2b employs substantially different computational models on closely related subtasks, i.e. four-digit versus eight-digit addition. Our findings suggest that algorithmic instability may be a contributing factor to language models' poor zero-shot performance across certain logical reasoning tasks, as they struggle to abstract different problem-solving strategies and smoothly transition between them.
