Table of Contents
Fetching ...

Position-aware Automatic Circuit Discovery

Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov

TL;DR

This work addresses the inadequacy of position-agnostic circuit discovery by introducing PEAP, a method that assigns edge importance per token position and accounts for cross-position attention edges. It also introduces dataset schemas to handle variable-length inputs, enabling the construction of abstract circuits that can be faithfully mapped back to concrete computation graphs across examples. The authors develop an automated, LLM-driven pipeline for generating and applying schemas, including saliency-informed prompting and multiple validation checks, achieving faithfulness comparable to manually designed schemas across several tasks and models. Empirically, position-aware circuits require smaller sizes to achieve similar or better faithfulness than non-positional circuits, highlighting improved interpretability and scalability for mechanistic analysis of transformers. Overall, the approach advances interpretable model analysis by capturing position-specific mechanisms and enabling automated, real-world circuit discovery.

Abstract

A widely used strategy to discover and understand language model mechanisms is circuit analysis. A circuit is a minimal subgraph of a model's computation graph that executes a specific task. We identify a gap in existing circuit discovery methods: they assume circuits are position-invariant, treating model components as equally relevant across input positions. This limits their ability to capture cross-positional interactions or mechanisms that vary across positions. To address this gap, we propose two improvements to incorporate positionality into circuits, even on tasks containing variable-length examples. First, we extend edge attribution patching, a gradient-based method for circuit discovery, to differentiate between token positions. Second, we introduce the concept of a dataset schema, which defines token spans with similar semantics across examples, enabling position-aware circuit discovery in datasets with variable length examples. We additionally develop an automated pipeline for schema generation and application using large language models. Our approach enables fully automated discovery of position-sensitive circuits, yielding better trade-offs between circuit size and faithfulness compared to prior work.

Position-aware Automatic Circuit Discovery

TL;DR

This work addresses the inadequacy of position-agnostic circuit discovery by introducing PEAP, a method that assigns edge importance per token position and accounts for cross-position attention edges. It also introduces dataset schemas to handle variable-length inputs, enabling the construction of abstract circuits that can be faithfully mapped back to concrete computation graphs across examples. The authors develop an automated, LLM-driven pipeline for generating and applying schemas, including saliency-informed prompting and multiple validation checks, achieving faithfulness comparable to manually designed schemas across several tasks and models. Empirically, position-aware circuits require smaller sizes to achieve similar or better faithfulness than non-positional circuits, highlighting improved interpretability and scalability for mechanistic analysis of transformers. Overall, the approach advances interpretable model analysis by capturing position-specific mechanisms and enabling automated, real-world circuit discovery.

Abstract

A widely used strategy to discover and understand language model mechanisms is circuit analysis. A circuit is a minimal subgraph of a model's computation graph that executes a specific task. We identify a gap in existing circuit discovery methods: they assume circuits are position-invariant, treating model components as equally relevant across input positions. This limits their ability to capture cross-positional interactions or mechanisms that vary across positions. To address this gap, we propose two improvements to incorporate positionality into circuits, even on tasks containing variable-length examples. First, we extend edge attribution patching, a gradient-based method for circuit discovery, to differentiate between token positions. Second, we introduce the concept of a dataset schema, which defines token spans with similar semantics across examples, enabling position-aware circuit discovery in datasets with variable length examples. We additionally develop an automated pipeline for schema generation and application using large language models. Our approach enables fully automated discovery of position-sensitive circuits, yielding better trade-offs between circuit size and faithfulness compared to prior work.

Paper Structure

This paper contains 39 sections, 8 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Positional vs. non-positional circuits. In a non-positional circuit, the same edges must be included at all positions. A positional circuit can distinguish between the same edge at different positions. This specificity yields better trade-offs between circuit size and faithfulness. It can also increase both precision and recall.
  • Figure 2: Left: The yellow edge at position 1 has the highest score of 100, indicating it is the most important edge. However, aggregating across positions causes scores of opposite signs to cancel. This causes the yellow edge to be incorrectly ranked as the least important. Right: The yellow edge at position 1 has the highest score; the scores of other edges are consistently high (but lower) at many positions. After summing across positions, the non-yellow edges have higher scores. Thus, the yellow edge is incorrectly ranked as the least important.
  • Figure 3: Illustration of the attention mechanism from the perspective of position 3. We approximate how patching $v_1$, $k_1$ or $q_3$ impacts the downstream metric via the output of the attention head at position 3.
  • Figure 4: Example schema for each task. We show examples from the LLM+Mask method. See §\ref{['ap:task details']} for examples of human-designed schemas.
  • Figure 5: Circuits defined over schemas. Every node/edge at position $s$ in the abstract computation graph is mapped to a set of nodes/edges in the full computation graph within the span $s$.
  • ...and 6 more figures