Table of Contents
Fetching ...

Discovering Variable Binding Circuitry with Desiderata

Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, David Bau

TL;DR

To tackle mechanistic interpretability in large language models, the paper proposes a desiderata-guided framework for causal interventions that localizes computation by constraints rather than exhaustive ablations. It extends activation patching by learning a sparse mask over model components to enforce desired effect patterns specified by desiderata. In a proof-of-concept, it identifies a shared variable-binding circuitry consisting of 10 components (9 attention heads and 1 MLP) that copies variable values into the final token residual stream, generalizing beyond the trained operations. The results highlight the importance of jointly enforcing multiple desiderata to isolate the target circuitry and demonstrate a scalable path toward automated mechanistic localization in large models.

Abstract

Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{desiderata}, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared \textit{variable binding circuitry} in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.

Discovering Variable Binding Circuitry with Desiderata

TL;DR

To tackle mechanistic interpretability in large language models, the paper proposes a desiderata-guided framework for causal interventions that localizes computation by constraints rather than exhaustive ablations. It extends activation patching by learning a sparse mask over model components to enforce desired effect patterns specified by desiderata. In a proof-of-concept, it identifies a shared variable-binding circuitry consisting of 10 components (9 attention heads and 1 MLP) that copies variable values into the final token residual stream, generalizing beyond the trained operations. The results highlight the importance of jointly enforcing multiple desiderata to isolate the target circuitry and demonstrate a scalable path toward automated mechanistic localization in large models.

Abstract

Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{desiderata}, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared \textit{variable binding circuitry} in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.
Paper Structure (17 sections, 2 equations, 5 figures, 1 table)

This paper contains 17 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Localizing computation with desiderata. The figure depicts training with a single (original, alternate, target) tuple within a desideratum. We learn a mask $w$ that combines activations from an alternate sequence $a$ into the computation of the model on the input of the original sequence $o$ such that the output $y$ moves towards the target $t$.
  • Figure 2: Variable Binding Desiderata. Each desideratum is a set of original ($o$), alternate ($a$), and target ($t$) 3-tuples. In the Value Dependence desideratum, patching should change the output to the alternate's output; in the Operation Invariance desideratum, patching should have no effect.
  • Figure 3: Evaluating masks of various numbers of heads on held-out VD and OI problems. Each vertical pair of datapoints corresponds to a mask learned by a training run with a different value of $\lambda$, the sparsity regularization weight. With too few components patched, the model does not score well at Value Dependence. We interpret this as indicating that not enough of the value-copying heads have been patched.
  • Figure 4: Transfer to accuracy on multiplication problems. This graph depicts the same masks as Fig. \ref{['fig:varied-lambda']} (which were trained on sequences involving only addition and subtraction), but evaluated on all-multiplication Value Dependence problems, and addition-to-multiplication (and vice versa) Operation Invariance problems. Similarly to Fig. \ref{['fig:varied-lambda']}, VD accuracy is low with too few heads patched.
  • Figure 5: Varying regularization strength with incomplete desiderata. This graph demonstrates learning a mask with only the Value Dependence desideratum. Again, each vertical pair of datapoints corresponds to a mask learned by a training run with a different value of $\lambda$, the sparsity regularization weight. Unlike when the mask is optimized according to both desiderata, these masks fail to achieve high accuracies on both Operation Invariance and Value dependence at the same time, as discussed in Section \ref{['sec:variable-binding']}.