Table of Contents
Fetching ...

The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks

Ziqian Zhong, Ziming Liu, Max Tegmark, Jacob Andreas

TL;DR

This work investigates how neural networks trained on modular addition rediscover internal algorithms, revealing that even simple tasks yield multiple solution strategies beyond a single Clock description. By analyzing two architectures (with and without attention) on the task $a+b\equiv c\pmod{p}$ with $p=59$, the authors identify the Clock and Pizza algorithms as dominant, along with evidence for hybrids and non-circular strategies in the networks’ internal representations. They introduce quantitative metrics—gradient symmetricity and distance irrelevance—to characterize algorithmic phases and demonstrate sharp phase transitions controlled by attention strength and width, including ensemble-like behavior via pizza accompaniments. The findings highlight the rich mechanistic landscape underlying neural computation, challenging the notion of a unique algorithmic solution and motivating systematic tools to map algorithmic phase spaces for interpretability. These insights have implications for the design of robust and transparent models, suggesting both opportunities and caveats for using mechanistic probes in real-world systems.

Abstract

Do neural networks, trained on well-understood algorithmic tasks, reliably rediscover known algorithms for solving those tasks? Several recent studies, on tasks ranging from group arithmetic to in-context linear regression, have suggested that the answer is yes. Using modular addition as a prototypical problem, we show that algorithm discovery in neural networks is sometimes more complex. Small changes to model hyperparameters and initializations can induce the discovery of qualitatively different algorithms from a fixed training set, and even parallel implementations of multiple such algorithms. Some networks trained to perform modular addition implement a familiar Clock algorithm; others implement a previously undescribed, less intuitive, but comprehensible procedure which we term the Pizza algorithm, or a variety of even more complex procedures. Our results show that even simple learning problems can admit a surprising diversity of solutions, motivating the development of new tools for characterizing the behavior of neural networks across their algorithmic phase space.

The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks

TL;DR

This work investigates how neural networks trained on modular addition rediscover internal algorithms, revealing that even simple tasks yield multiple solution strategies beyond a single Clock description. By analyzing two architectures (with and without attention) on the task with , the authors identify the Clock and Pizza algorithms as dominant, along with evidence for hybrids and non-circular strategies in the networks’ internal representations. They introduce quantitative metrics—gradient symmetricity and distance irrelevance—to characterize algorithmic phases and demonstrate sharp phase transitions controlled by attention strength and width, including ensemble-like behavior via pizza accompaniments. The findings highlight the rich mechanistic landscape underlying neural computation, challenging the notion of a unique algorithmic solution and motivating systematic tools to map algorithmic phase spaces for interpretability. These insights have implications for the design of robust and transparent models, suggesting both opportunities and caveats for using mechanistic probes in real-world systems.

Abstract

Do neural networks, trained on well-understood algorithmic tasks, reliably rediscover known algorithms for solving those tasks? Several recent studies, on tasks ranging from group arithmetic to in-context linear regression, have suggested that the answer is yes. Using modular addition as a prototypical problem, we show that algorithm discovery in neural networks is sometimes more complex. Small changes to model hyperparameters and initializations can induce the discovery of qualitatively different algorithms from a fixed training set, and even parallel implementations of multiple such algorithms. Some networks trained to perform modular addition implement a familiar Clock algorithm; others implement a previously undescribed, less intuitive, but comprehensible procedure which we term the Pizza algorithm, or a variety of even more complex procedures. Our results show that even simple learning problems can admit a surprising diversity of solutions, motivating the development of new tools for characterizing the behavior of neural networks across their algorithmic phase space.
Paper Structure (53 sections, 1 theorem, 10 equations, 26 figures, 3 tables)

This paper contains 53 sections, 1 theorem, 10 equations, 26 figures, 3 tables.

Key Result

Lemma A.1

A symmetric function $f(x,y)$ that is a linear combination of $\cos x,\sin x,\cos y,\sin y$The actual neural networks could be more complicated - even if our neural network is locally linear and symmetric, locally they could be asymmetric (e.g. $|x|+|y|$ could locally be $x-y$). Nevertheless, the pa

Figures (26)

  • Figure 1: Illustration of the Clock and the Pizza Algorithm.
  • Figure 2: Gradients on first six principal components of input embeddings. $(a,b,c)$ in the title stands for taking gradients on the output logit $c$ for input $(a,b)$. x and y axes represent the gradients for embeddings of the first and the second token. The dashed line $y=x$ signals a symmetric gradient.
  • Figure 3: Correct Logits of Model A & Model B. The correct logits of Model A (left) have a clear dependence on $a-b$, while those of Model B (right) do not.
  • Figure 4: Correct logits of Model A (Pizza) after circle isolation. The rightmost pizza is accompanying the third pizza (discussed in Section \ref{['sec:accompany']} and Appendix \ref{['sec:pizzapairs']}). Top: The logit pattern depends on $a-b$. Bottom: Embeddings for each circle.
  • Figure 5: Correct logits of Model B (Clock) after circle isolation. Top: The logit pattern depends on $a+b$. Bottom: Embeddings for each circle.
  • ...and 21 more figures

Theorems & Definitions (5)

  • Definition 4.1: Gradient Symmetricity
  • Definition 4.2: Distance Irrelevance
  • Lemma A.1
  • proof
  • Definition B.1: Circularity