Discovering Variable Binding Circuitry with Desiderata
Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, David Bau
TL;DR
To tackle mechanistic interpretability in large language models, the paper proposes a desiderata-guided framework for causal interventions that localizes computation by constraints rather than exhaustive ablations. It extends activation patching by learning a sparse mask over model components to enforce desired effect patterns specified by desiderata. In a proof-of-concept, it identifies a shared variable-binding circuitry consisting of 10 components (9 attention heads and 1 MLP) that copies variable values into the final token residual stream, generalizing beyond the trained operations. The results highlight the importance of jointly enforcing multiple desiderata to isolate the target circuitry and demonstrate a scalable path toward automated mechanistic localization in large models.
Abstract
Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{desiderata}, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared \textit{variable binding circuitry} in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.
