Table of Contents
Fetching ...

Transformer Circuit Faithfulness Metrics are not Robust

Joseph Miller, Bilal Chughtai, William Saunders

TL;DR

This paper tackles the problem that transformer circuit faithfulness metrics are not robust to modest ablation variations. It surveys a wide range of ablation design choices—granularity, component type, ablation value, token positions, ablation direction, and whether the circuit or its complement is ablative—and tests how these choices affect faithfulness across circuits and tasks. The authors show that faithfulness scores can vary dramatically with methodology, and that the notion of an optimal circuit is defined by prompts and evaluation setup rather than by universal circuit structure. They conclude with practical recommendations for clearer, more replicable claims about circuits and announce an open-source AutoCircuit library to empower standardized, efficient ablation-based circuit analysis.

Abstract

Mechanistic interpretability work attempts to reverse engineer the learned algorithms present inside neural networks. One focus of this work has been to discover 'circuits' -- subgraphs of the full model that explain behaviour on specific tasks. But how do we measure the performance of such circuits? Prior work has attempted to measure circuit 'faithfulness' -- the degree to which the circuit replicates the performance of the full model. In this work, we survey many considerations for designing experiments that measure circuit faithfulness by ablating portions of the model's computation. Concerningly, we find existing methods are highly sensitive to seemingly insignificant changes in the ablation methodology. We conclude that existing circuit faithfulness scores reflect both the methodological choices of researchers as well as the actual components of the circuit - the task a circuit is required to perform depends on the ablation used to test it. The ultimate goal of mechanistic interpretability work is to understand neural networks, so we emphasize the need for more clarity in the precise claims being made about circuits. We open source a library at https://github.com/UFO-101/auto-circuit that includes highly efficient implementations of a wide range of ablation methodologies and circuit discovery algorithms.

Transformer Circuit Faithfulness Metrics are not Robust

TL;DR

This paper tackles the problem that transformer circuit faithfulness metrics are not robust to modest ablation variations. It surveys a wide range of ablation design choices—granularity, component type, ablation value, token positions, ablation direction, and whether the circuit or its complement is ablative—and tests how these choices affect faithfulness across circuits and tasks. The authors show that faithfulness scores can vary dramatically with methodology, and that the notion of an optimal circuit is defined by prompts and evaluation setup rather than by universal circuit structure. They conclude with practical recommendations for clearer, more replicable claims about circuits and announce an open-source AutoCircuit library to empower standardized, efficient ablation-based circuit analysis.

Abstract

Mechanistic interpretability work attempts to reverse engineer the learned algorithms present inside neural networks. One focus of this work has been to discover 'circuits' -- subgraphs of the full model that explain behaviour on specific tasks. But how do we measure the performance of such circuits? Prior work has attempted to measure circuit 'faithfulness' -- the degree to which the circuit replicates the performance of the full model. In this work, we survey many considerations for designing experiments that measure circuit faithfulness by ablating portions of the model's computation. Concerningly, we find existing methods are highly sensitive to seemingly insignificant changes in the ablation methodology. We conclude that existing circuit faithfulness scores reflect both the methodological choices of researchers as well as the actual components of the circuit - the task a circuit is required to perform depends on the ablation used to test it. The ultimate goal of mechanistic interpretability work is to understand neural networks, so we emphasize the need for more clarity in the precise claims being made about circuits. We open source a library at https://github.com/UFO-101/auto-circuit that includes highly efficient implementations of a wide range of ablation methodologies and circuit discovery algorithms.
Paper Structure (25 sections, 5 equations, 12 figures, 4 tables)

This paper contains 25 sections, 5 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: The factorized and 'treeified' formulations of transformers suggest more specific ablations than ablating whole nodes.
  • Figure 2: Two approaches to testing a circuit that both measure faithfulness as the similarity of the output to the full model.
  • Figure 3: The IOI faithfulness metric is sensitive to (1) ablating edges/nodes, (2) the type of ablation used -- we test Resample Ablations and Mean Ablations (over a dataset of $100$ ABC prompts, which differs from wang2022interpretability) and (3) whether we distinguish between token positions in the circuit. The original IOI work evaluated at specific token positions with Mean Node Ablations and obtained a logit difference recovery of 87%. Other methodologies giving faithfulness scores above 100% or below 0% would have given the authors significantly less confidence about the IOI circuit, and may have led them to include different edges.
  • Figure 4: (Left) The IOI circuit is sensitive to the size of ABC dataset used for mean ablation. The logit difference recovered is consistently higher for prompts of the BABA format. (Left and Middle Left) The order of computing the average and percentage affects the faithfulness metric. wang2022interpretability use [Average Logit Diff] %, giving lower scores than Average [Logit Diff %]. (Middle Right and Right) There is a large range of logit difference recovered, the boxplots show the interquartile range. According to this faithfulness measurement methodology, The IOI circuit implements the IOI task faithfully on average, but not for many single data points.
  • Figure 5: ROC Curves measuring the overlap between automatically discovered circuits and the two different "ground truth" circuits, for two Tracr tasks. When we match the ablation methodology of the ground truth with the ablation methodology of the circuit discovery algorithms, we can achieve perfect circuit recovery with all three methods.
  • ...and 7 more figures