Transformer Circuit Faithfulness Metrics are not Robust

Joseph Miller; Bilal Chughtai; William Saunders

Transformer Circuit Faithfulness Metrics are not Robust

Joseph Miller, Bilal Chughtai, William Saunders

TL;DR

This paper tackles the problem that transformer circuit faithfulness metrics are not robust to modest ablation variations. It surveys a wide range of ablation design choices—granularity, component type, ablation value, token positions, ablation direction, and whether the circuit or its complement is ablative—and tests how these choices affect faithfulness across circuits and tasks. The authors show that faithfulness scores can vary dramatically with methodology, and that the notion of an optimal circuit is defined by prompts and evaluation setup rather than by universal circuit structure. They conclude with practical recommendations for clearer, more replicable claims about circuits and announce an open-source AutoCircuit library to empower standardized, efficient ablation-based circuit analysis.

Abstract

Mechanistic interpretability work attempts to reverse engineer the learned algorithms present inside neural networks. One focus of this work has been to discover 'circuits' -- subgraphs of the full model that explain behaviour on specific tasks. But how do we measure the performance of such circuits? Prior work has attempted to measure circuit 'faithfulness' -- the degree to which the circuit replicates the performance of the full model. In this work, we survey many considerations for designing experiments that measure circuit faithfulness by ablating portions of the model's computation. Concerningly, we find existing methods are highly sensitive to seemingly insignificant changes in the ablation methodology. We conclude that existing circuit faithfulness scores reflect both the methodological choices of researchers as well as the actual components of the circuit - the task a circuit is required to perform depends on the ablation used to test it. The ultimate goal of mechanistic interpretability work is to understand neural networks, so we emphasize the need for more clarity in the precise claims being made about circuits. We open source a library at https://github.com/UFO-101/auto-circuit that includes highly efficient implementations of a wide range of ablation methodologies and circuit discovery algorithms.

Transformer Circuit Faithfulness Metrics are not Robust

TL;DR

Abstract

Paper Structure (25 sections, 5 equations, 12 figures, 4 tables)

This paper contains 25 sections, 5 equations, 12 figures, 4 tables.

Introduction
Related Work
Measuring Faithfulness
Ablation Methodology
Circuit Granularity
Ablation Component Type (and Associated Model Views)
Ablation Value
Token Positions
Ablation Direction and Testing Circuits
Metric
Faithfulness Metrics are Sensitive to Ablation Methodology
Variance Between Ablation Methodologies
Variance Between Individual Datapoints
Optimal Circuits Are Defined By Prompts and Ablation Methodologies
Conclusion
...and 10 more sections

Figures (12)

Figure 1: The factorized and 'treeified' formulations of transformers suggest more specific ablations than ablating whole nodes.
Figure 2: Two approaches to testing a circuit that both measure faithfulness as the similarity of the output to the full model.
Figure 3: The IOI faithfulness metric is sensitive to (1) ablating edges/nodes, (2) the type of ablation used -- we test Resample Ablations and Mean Ablations (over a dataset of $100$ ABC prompts, which differs from wang2022interpretability) and (3) whether we distinguish between token positions in the circuit. The original IOI work evaluated at specific token positions with Mean Node Ablations and obtained a logit difference recovery of 87%. Other methodologies giving faithfulness scores above 100% or below 0% would have given the authors significantly less confidence about the IOI circuit, and may have led them to include different edges.
Figure 4: (Left) The IOI circuit is sensitive to the size of ABC dataset used for mean ablation. The logit difference recovered is consistently higher for prompts of the BABA format. (Left and Middle Left) The order of computing the average and percentage affects the faithfulness metric. wang2022interpretability use [Average Logit Diff] %, giving lower scores than Average [Logit Diff %]. (Middle Right and Right) There is a large range of logit difference recovered, the boxplots show the interquartile range. According to this faithfulness measurement methodology, The IOI circuit implements the IOI task faithfully on average, but not for many single data points.
Figure 5: ROC Curves measuring the overlap between automatically discovered circuits and the two different "ground truth" circuits, for two Tracr tasks. When we match the ablation methodology of the ground truth with the ablation methodology of the circuit discovery algorithms, we can achieve perfect circuit recovery with all three methods.
...and 7 more figures

Transformer Circuit Faithfulness Metrics are not Robust

TL;DR

Abstract

Transformer Circuit Faithfulness Metrics are not Robust

Authors

TL;DR

Abstract

Table of Contents

Figures (12)