InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Rohan Gupta, Iván Arcuschin, Thomas Kwa, Adrià Garriga-Alonso
TL;DR
InterpBench introduces 86 semi-synthetic transformers with ground-truth circuits to enable rigorous validation of mechanistic interpretability methods. By extending IIT to SIIT, the benchmark ensures non-aligned low-level components cannot spuriously influence outputs, yielding models that faithfully implement known circuits. Realism analyses show SIIT transformers resemble naturally trained models in weight distributions and behavior, enabling meaningful benchmarking of circuit-discovery techniques. Empirical results demonstrate ACDC's superiority over several baselines on this ground-truth benchmark, while EAP-IG remains competitive, underscoring SIIT's value for robust evaluation of MI methods and ground-truth circuit discovery.
Abstract
Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the true algorithm is unknown. This work presents InterpBench, a collection of semi-synthetic yet realistic transformers with known circuits for evaluating these techniques. We train simple neural networks using a stricter version of Interchange Intervention Training (IIT) which we call Strict IIT (SIIT). Like the original, SIIT trains neural networks by aligning their internal computation with a desired high-level causal model, but it also prevents non-circuit nodes from affecting the model's output. We evaluate SIIT on sparse transformers produced by the Tracr tool and find that SIIT models maintain Tracr's original circuit while being more realistic. SIIT can also train transformers with larger circuits, like Indirect Object Identification (IOI). Finally, we use our benchmark to evaluate existing circuit discovery techniques.
