Table of Contents
Fetching ...

Uncovering Intermediate Variables in Transformers using Circuit Probing

Michael A. Lepori, Thomas Serre, Ellie Pavlick

TL;DR

Uncovering Intermediate Variables in Transformers using Circuit Probing introduces circuit probing, a method that learns binary masks to isolate subcircuits responsible for hypothesized intermediate variables in Transformer models. It integrates probing and causal analysis and demonstrates its fidelity on toy arithmetic tasks and on GPT2 variants for syntactic dependencies like subject-verb agreement and reflexive anaphora. Across four experiments it reveals modular, reusable circuits and shows the circuit is formed during training before generalization; it outperforms several baselines such as linear/nonlinear probing, amnesic probing, and counterfactual embeddings in identifying causally implicated variables. The approach offers a unified framework for mechanistic interpretability with potential implications for bias, safety, and model debugging.

Abstract

Neural network models have achieved high performance on a wide variety of complex tasks, but the algorithms that they implement are notoriously difficult to interpret. It is often necessary to hypothesize intermediate variables involved in a network's computation in order to understand these algorithms. For example, does a language model depend on particular syntactic properties when generating a sentence? Yet, existing analysis tools make it difficult to test hypotheses of this type. We propose a new analysis technique - circuit probing - that automatically uncovers low-level circuits that compute hypothesized intermediate variables. This enables causal analysis through targeted ablation at the level of model parameters. We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have learned, (2) revealing modular structure within a model, and (3) tracking the development of circuits over training. Across these three experiments we demonstrate that circuit probing combines and extends the capabilities of existing methods, providing one unified approach for a variety of analyses. Finally, we demonstrate circuit probing on a real-world use case: uncovering circuits that are responsible for subject-verb agreement and reflexive anaphora in GPT2-Small and Medium.

Uncovering Intermediate Variables in Transformers using Circuit Probing

TL;DR

Uncovering Intermediate Variables in Transformers using Circuit Probing introduces circuit probing, a method that learns binary masks to isolate subcircuits responsible for hypothesized intermediate variables in Transformer models. It integrates probing and causal analysis and demonstrates its fidelity on toy arithmetic tasks and on GPT2 variants for syntactic dependencies like subject-verb agreement and reflexive anaphora. Across four experiments it reveals modular, reusable circuits and shows the circuit is formed during training before generalization; it outperforms several baselines such as linear/nonlinear probing, amnesic probing, and counterfactual embeddings in identifying causally implicated variables. The approach offers a unified framework for mechanistic interpretability with potential implications for bias, safety, and model debugging.

Abstract

Neural network models have achieved high performance on a wide variety of complex tasks, but the algorithms that they implement are notoriously difficult to interpret. It is often necessary to hypothesize intermediate variables involved in a network's computation in order to understand these algorithms. For example, does a language model depend on particular syntactic properties when generating a sentence? Yet, existing analysis tools make it difficult to test hypotheses of this type. We propose a new analysis technique - circuit probing - that automatically uncovers low-level circuits that compute hypothesized intermediate variables. This enables causal analysis through targeted ablation at the level of model parameters. We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have learned, (2) revealing modular structure within a model, and (3) tracking the development of circuits over training. Across these three experiments we demonstrate that circuit probing combines and extends the capabilities of existing methods, providing one unified approach for a variety of analyses. Finally, we demonstrate circuit probing on a real-world use case: uncovering circuits that are responsible for subject-verb agreement and reflexive anaphora in GPT2-Small and Medium.
Paper Structure (48 sections, 4 equations, 35 figures, 8 tables, 2 algorithms)

This paper contains 48 sections, 4 equations, 35 figures, 8 tables, 2 algorithms.

Figures (35)

  • Figure 1: Schematic visualization of circuit probing for an intermediate variable representing the syntactic number of the subject of a sentence. Plural subjects are represented in red and singular subjects in blue. At step T$_0$, prior to training a binary mask, the model component (Attention block or MLP) produces residual stream updates (red and blue arrows; elhage2021mathematical) that are not partitioned by syntactic number (i.e. they will point in seemingly-random directions). Circuit probing optimizes a binary mask over model weights. By the end of mask optimization, the circuit will produce updates that are partitioned by syntactic number (i.e. point in one direction for singular subjects and another for plural subjects).
  • Figure 2: (a) We see generalization long after overfitting. (b) Probing for $a^2$. Linear and nonlinear probing converge to perfect accuracy very early in training, while circuit probing reveals that the circuit for $a^2$ is formed gradually through training. (c) Probing results for $b^2$, which is not causally implicated in the task $a^2 + b$. Circuit probing reveals that this variable is not represented at any point during training, whereas other methods imply that it is represented from the start of training. (d) Amnesic Probing incorrectly implies that (1) $a^2$ is causally implicated from the start, and (2) $b^2$ is causally implicated throughout training.
  • Figure 3: Experiment 4 GPT2-Small ablation results on layer 6's attention block. Across both Subject-Verb Agreement (Left) and Reflexive Anaphora evaluated using the masculine (Middle) and feminine (Right) pronoun, we see that ablating the discovered circuit renders the model worse at distinguishing syntactic number. Ablating randomly sampled subnetworks has does not hurt the model's ability to distinguish singular and plural subjects/referents.
  • Figure 4: MLP probe accuracy for Experiment 1. All methods decode $a^2$ and $-1 * b^2$ worse in the MLP than in the attention block. Note that chance accuracy for circuit probing is effectively 50%.
  • Figure 5: (Left) Transfer performance for $a^2$. We see that pretraining on $a^2-b^2$ confers a benefit to the model when finetuning on $a^2$. (Right) Transfer performance for $a+b$. We see that pretraining on $a^2-b^2$ is a detriment when finetuning on $a+b$.
  • ...and 30 more figures