Uncovering Intermediate Variables in Transformers using Circuit Probing
Michael A. Lepori, Thomas Serre, Ellie Pavlick
TL;DR
Uncovering Intermediate Variables in Transformers using Circuit Probing introduces circuit probing, a method that learns binary masks to isolate subcircuits responsible for hypothesized intermediate variables in Transformer models. It integrates probing and causal analysis and demonstrates its fidelity on toy arithmetic tasks and on GPT2 variants for syntactic dependencies like subject-verb agreement and reflexive anaphora. Across four experiments it reveals modular, reusable circuits and shows the circuit is formed during training before generalization; it outperforms several baselines such as linear/nonlinear probing, amnesic probing, and counterfactual embeddings in identifying causally implicated variables. The approach offers a unified framework for mechanistic interpretability with potential implications for bias, safety, and model debugging.
Abstract
Neural network models have achieved high performance on a wide variety of complex tasks, but the algorithms that they implement are notoriously difficult to interpret. It is often necessary to hypothesize intermediate variables involved in a network's computation in order to understand these algorithms. For example, does a language model depend on particular syntactic properties when generating a sentence? Yet, existing analysis tools make it difficult to test hypotheses of this type. We propose a new analysis technique - circuit probing - that automatically uncovers low-level circuits that compute hypothesized intermediate variables. This enables causal analysis through targeted ablation at the level of model parameters. We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have learned, (2) revealing modular structure within a model, and (3) tracking the development of circuits over training. Across these three experiments we demonstrate that circuit probing combines and extends the capabilities of existing methods, providing one unified approach for a variety of analyses. Finally, we demonstrate circuit probing on a real-world use case: uncovering circuits that are responsible for subject-verb agreement and reflexive anaphora in GPT2-Small and Medium.
