Table of Contents
Fetching ...

Pitfalls in Evaluating Interpretability Agents

Tal Haklay, Nikhil Prakash, Sana Pandey, Antonio Torralba, Aaron Mueller, Jacob Andreas, Tamar Rott Shaham, Yonatan Belinkov

Abstract

Automated interpretability systems aim to reduce the need for human labor and scale analysis to increasingly large models and diverse tasks. Recent efforts toward this goal leverage large language models (LLMs) at increasing levels of autonomy, ranging from fixed one-shot workflows to fully autonomous interpretability agents. This shift creates a corresponding need to scale evaluation approaches to keep pace with both the volume and complexity of generated explanations. We investigate this challenge in the context of automated circuit analysis -- explaining the roles of model components when performing specific tasks. To this end, we build an agentic system in which a research agent iteratively designs experiments and refines hypotheses. When evaluated against human expert explanations across six circuit analysis tasks in the literature, the system appears competitive. However, closer examination reveals several pitfalls of replication-based evaluation: human expert explanations can be subjective or incomplete, outcome-based comparisons obscure the research process, and LLM-based systems may reproduce published findings via memorization or informed guessing. To address some of these pitfalls, we propose an unsupervised intrinsic evaluation based on the functional interchangeability of model components. Our work demonstrates fundamental challenges in evaluating complex automated interpretability systems and reveals key limitations of replication-based evaluation.

Pitfalls in Evaluating Interpretability Agents

Abstract

Automated interpretability systems aim to reduce the need for human labor and scale analysis to increasingly large models and diverse tasks. Recent efforts toward this goal leverage large language models (LLMs) at increasing levels of autonomy, ranging from fixed one-shot workflows to fully autonomous interpretability agents. This shift creates a corresponding need to scale evaluation approaches to keep pace with both the volume and complexity of generated explanations. We investigate this challenge in the context of automated circuit analysis -- explaining the roles of model components when performing specific tasks. To this end, we build an agentic system in which a research agent iteratively designs experiments and refines hypotheses. When evaluated against human expert explanations across six circuit analysis tasks in the literature, the system appears competitive. However, closer examination reveals several pitfalls of replication-based evaluation: human expert explanations can be subjective or incomplete, outcome-based comparisons obscure the research process, and LLM-based systems may reproduce published findings via memorization or informed guessing. To address some of these pitfalls, we propose an unsupervised intrinsic evaluation based on the functional interchangeability of model components. Our work demonstrates fundamental challenges in evaluating complex automated interpretability systems and reveals key limitations of replication-based evaluation.
Paper Structure (40 sections, 4 equations, 8 figures, 9 tables)

This paper contains 40 sections, 4 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Top: Overview of the system workflow. (A) A researcher specifies a task and a circuit to analyze. (B) A researcher agent iteratively analyzes each component independently, autonomously designing and running experiments as needed. (C) Claude clusters components based on shared functionality inferred from the generated hypotheses. Bottom: Example of tool calls and the corresponding results returned by the tools.
  • Figure 2: The judge workflow. The system produces hypotheses explaining the functionality of individual components and clusters. A judge model is tasked with matching the hypothesis to one of the descriptions reported by the researchers in the original paper.
  • Figure 3: Performance comparison across six circuit analysis tasks. Left: Component Functionality Accuracy measures how well individual component explanations match human-labeled clusters. Middle: Cluster Functionality Accuracy measures how well cluster explanations match human-labeled clusters. Right: Component Assignment Accuracy assesses cluster alignment with expert-defined clusters using optimal matching. While the systems obtain relatively high results, they usually do not perfectly match expert explanations.
  • Figure 4: Examples of the agent’s experimental designs. Blue: an IOI task example from the initial prompt set provided to the agent. Orange: the agent’s motivation for the experiment. Green: new example prompts proposed by the agent to test its hypotheses.
  • Figure 5: Top: Example of indirect memorization in a final hypothesis produced by the one-shot system for attention head 9.9 in the IOI task. The system explicitly uses the term "name mover head", reproducing the terminology introduced in the original IOI paper. For more details on "name mover" heads, see App. \ref{['ap:task-ioi']}. Bottom: Example of direct recall of the IOI circuit by Claude when it is explicitly asked to recall the circuit from memory. For more details on the prompt used and the full response, see App. \ref{['ap:ioi_memory']}.
  • ...and 3 more figures