Table of Contents
Fetching ...

Tracking Equivalent Mechanistic Interpretations Across Neural Networks

Alan Sun, Mariya Toneva

Abstract

Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model's decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: there is no precise notion of a valid interpretation; and, generating interpretations is often an ad hoc process. In this paper, we address these challenges by defining and studying the problem of interpretive equivalence: determining whether two different models share a common interpretation, without requiring an explicit description of what that interpretation is. At the core of our approach, we propose and formalize the principle that two interpretations of a model are equivalent if all of their possible implementations are also equivalent. We develop an algorithm to estimate interpretive equivalence and case study its use on Transformer-based models. To analyze our algorithm, we introduce necessary and sufficient conditions for interpretive equivalence based on models' representation similarity. We provide guarantees that simultaneously relate a model's algorithmic interpretations, circuits, and representations. Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.

Tracking Equivalent Mechanistic Interpretations Across Neural Networks

Abstract

Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model's decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: there is no precise notion of a valid interpretation; and, generating interpretations is often an ad hoc process. In this paper, we address these challenges by defining and studying the problem of interpretive equivalence: determining whether two different models share a common interpretation, without requiring an explicit description of what that interpretation is. At the core of our approach, we propose and formalize the principle that two interpretations of a model are equivalent if all of their possible implementations are also equivalent. We develop an algorithm to estimate interpretive equivalence and case study its use on Transformer-based models. To analyze our algorithm, we introduce necessary and sufficient conditions for interpretive equivalence based on models' representation similarity. We provide guarantees that simultaneously relate a model's algorithmic interpretations, circuits, and representations. Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.

Paper Structure

This paper contains 30 sections, 12 theorems, 54 equations, 5 figures.

Key Result

Theorem E.1

For $i\in\{1,2\}$, let $S \subset \Sigma^\star$ and $h_i$ be a language model such that $h_1(S) = h_2(S)$. Suppose that $h_i$ admit a circuit $K_i \triangleq ({\mathbb{V}}_i, U, {\mathbb{F}}_i, \succ)$. Then,

Figures (5)

  • Figure 1: A high-level overview of our algorithmic approach to interpretive equivalence. Consider models $h_{1}, h_{2}$ that correspond to possibly unknown interpretations $A_1,A_2$(Left). To determine whether models $h_1$ and $h_2$ are interpretively equivalent, we propose a two-step procedure. First, we sample another model $h^\star$ that also has interpretation $A_1$(Center). Second, we compare the representation similarity ($d_{\mathrm{repr}}$) between $h_1, h^\star$ and $h^\star, h_{2}$(Right). If $A_1,A_2$ are equivalent, then averaged over all implementations $h^\star$, we should not be able to differentiate $d_{\mathrm{repr}}(h_{1}, h^\star)$ and $d_{\mathrm{repr}}(h^\star, h_{2})$.
  • Figure 2: (Left) Average congruity between models associated with different interpretations. $\blacksquare$ indicates models have different interpretations; whereas $\blacksquare$ indicates that models have statistically indistinguishable interpretations. (Center) Congruity between GPT2 and Pythia family of models on the IOI task. groups models based on their actual interpretive differences observed by tigges_llm_2024merullo2024circuit. (Right) Congruity between GPT2 on next-token prediction (for different token types: all tokens, articles, prepositions, punctuation, parentheses, and terminal punctuation) vs. GPT2 on in-context parts-of-speech identification.
  • Figure 3:
  • Figure 4: Pipeline for generating implementations for our interpretations (i.e. RASP language models). We first construct RASP programs, then using the procedure introduced in lindner_tracr_2023, we compile these programs into Transformer models. These Transformers exclusively have hard attention. Moreover, their architecture is minimal (containing only the necessary components to fully implement the given RASP program). We then apply the procedure introduced in gupta2024interpbench to translate these Tracr-Compiled Transformers into "real" Transformers: ones whose weight distribution matches those trained with stochastic gradient descent; these translated models also contain more
  • Figure 5: Activation patching results for each attention head. The coloring indicates the % of performance this individual attention head contributes to POS performance.

Theorems & Definitions (55)

  • Example 1.1: Reduction to Simpler Models
  • Example 1.2: Reduction to Simpler Tasks
  • Definition 4.1: Deterministic Causal Model, geiger2023causal geiger2023causal
  • Definition 4.2: Circuit
  • Definition 4.3: Representations
  • Definition 4.4
  • Definition 5.1: Implementation
  • Definition 5.2
  • Definition 5.3
  • Definition 6.1: Representation Similarity
  • ...and 45 more