InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Rohan Gupta; Iván Arcuschin; Thomas Kwa; Adrià Garriga-Alonso

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Rohan Gupta, Iván Arcuschin, Thomas Kwa, Adrià Garriga-Alonso

TL;DR

InterpBench introduces 86 semi-synthetic transformers with ground-truth circuits to enable rigorous validation of mechanistic interpretability methods. By extending IIT to SIIT, the benchmark ensures non-aligned low-level components cannot spuriously influence outputs, yielding models that faithfully implement known circuits. Realism analyses show SIIT transformers resemble naturally trained models in weight distributions and behavior, enabling meaningful benchmarking of circuit-discovery techniques. Empirical results demonstrate ACDC's superiority over several baselines on this ground-truth benchmark, while EAP-IG remains competitive, underscoring SIIT's value for robust evaluation of MI methods and ground-truth circuit discovery.

Abstract

Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the true algorithm is unknown. This work presents InterpBench, a collection of semi-synthetic yet realistic transformers with known circuits for evaluating these techniques. We train simple neural networks using a stricter version of Interchange Intervention Training (IIT) which we call Strict IIT (SIIT). Like the original, SIIT trains neural networks by aligning their internal computation with a desired high-level causal model, but it also prevents non-circuit nodes from affecting the model's output. We evaluate SIIT on sparse transformers produced by the Tracr tool and find that SIIT models maintain Tracr's original circuit while being more realistic. SIIT can also train transformers with larger circuits, like Indirect Object Identification (IOI). Finally, we use our benchmark to evaluate existing circuit discovery techniques.

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

TL;DR

Abstract

Paper Structure (25 sections, 6 equations, 18 figures, 8 tables, 1 algorithm)

This paper contains 25 sections, 6 equations, 18 figures, 8 tables, 1 algorithm.

Introduction
Contributions
Related work
Linearly compressed Tracr models.
Features in MI.
Other MI benchmarks.
Strict Interchange Intervention Training
InterpBench
Evaluation
Results
RQ1 & RQ2.
RQ3.
RQ4.
Conclusion
Limitations.
...and 10 more sections

Figures (18)

Figure 1: SIIT transformers implement a known ground-truth circuit, but their weights and activations are similar to the ones in naturally trained transformers, letting us measure, in a realistic setting, how accurate circuit discovery methods are at finding the true circuit.
Figure 2: A histogram of the weights for the MLP output matrix in Layer 0 of a Tracr, SIIT, and "natural" transformer, i.e. trained by gradient descent to do supervised learning. All these transformers implement the frac_prevs task DBLP:conf/nips/LindnerKFRMM23. The weight distribution of an SIIT-trained transformer is much closer to the natural than the Tracr transformer. Yet, we know the ground-truth algorithm that the SIIT transformer implements. We provide the KL divergence between these histograms in \ref{['tab:kl-div-weight-histograms']}.
Figure 3: Example of a low-level model that has a perfect accuracy, with aligned low-level nodes (in yellow) that are causally consistent with the high-level model, but has non-aligned nodes (in grey) that affect the output.
Figure 4: Circuit for Indirect Object Identification task in InterpBench. This circuit is a simplified version of the one manually discovered by DBLP:conf/iclr/WangVCSS23. The Duplicate token head outputs the first position of duplicated tokens, if there is any; otherwise it outputs $-1$. The S-Inhibition head copies the token from the previous position and outputs it to the Name mover head, which increases the logits of all names except the ones that are inhibited.
Figure 5: Average effect on accuracy for nodes in the circuit (green) and out of the circuit (red) for the models of $7$ randomly sampled tasks in the benchmark. Boxplots display, for each task and model, the average proportion of model outputs that change when intervening on nodes. For all regression tasks, we deem an intervention to have an effect when the new scalar output differs by $0.05$ or more from the original. We can see that for Tracr and SIIT models, nodes not in the circuit have much lower effects, but that is not the case for IIT models.
...and 13 more figures

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

TL;DR

Abstract

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Authors

TL;DR

Abstract

Table of Contents

Figures (18)