Table of Contents
Fetching ...

Hypothesis Testing the Circuit Hypothesis in LLMs

Claudia Shi, Nicolas Beltran-Velez, Achille Nazaret, Carolina Zheng, Adrià Garriga-Alonso, Andrew Jesson, Maggie Makar, David M. Blei

TL;DR

A set of criteria that a circuit is hypothesized to meet is formalized and a suite of hypothesis tests to evaluate how well circuits satisfy them are developed, finding that synthetic circuits -- circuits that are hard-coded in the model -- align with the idealized properties.

Abstract

Large language models (LLMs) demonstrate surprising capabilities, but we do not understand how they are implemented. One hypothesis suggests that these capabilities are primarily executed by small subnetworks within the LLM, known as circuits. But how can we evaluate this hypothesis? In this paper, we formalize a set of criteria that a circuit is hypothesized to meet and develop a suite of hypothesis tests to evaluate how well circuits satisfy them. The criteria focus on the extent to which the LLM's behavior is preserved, the degree of localization of this behavior, and whether the circuit is minimal. We apply these tests to six circuits described in the research literature. We find that synthetic circuits -- circuits that are hard-coded in the model -- align with the idealized properties. Circuits discovered in Transformer models satisfy the criteria to varying degrees. To facilitate future empirical studies of circuits, we created the \textit{circuitry} package, a wrapper around the \textit{TransformerLens} library, which abstracts away lower-level manipulations of hooks and activations. The software is available at \url{https://github.com/blei-lab/circuitry}.

Hypothesis Testing the Circuit Hypothesis in LLMs

TL;DR

A set of criteria that a circuit is hypothesized to meet is formalized and a suite of hypothesis tests to evaluate how well circuits satisfy them are developed, finding that synthetic circuits -- circuits that are hard-coded in the model -- align with the idealized properties.

Abstract

Large language models (LLMs) demonstrate surprising capabilities, but we do not understand how they are implemented. One hypothesis suggests that these capabilities are primarily executed by small subnetworks within the LLM, known as circuits. But how can we evaluate this hypothesis? In this paper, we formalize a set of criteria that a circuit is hypothesized to meet and develop a suite of hypothesis tests to evaluate how well circuits satisfy them. The criteria focus on the extent to which the LLM's behavior is preserved, the degree of localization of this behavior, and whether the circuit is minimal. We apply these tests to six circuits described in the research literature. We find that synthetic circuits -- circuits that are hard-coded in the model -- align with the idealized properties. Circuits discovered in Transformer models satisfy the criteria to varying degrees. To facilitate future empirical studies of circuits, we created the \textit{circuitry} package, a wrapper around the \textit{TransformerLens} library, which abstracts away lower-level manipulations of hooks and activations. The software is available at \url{https://github.com/blei-lab/circuitry}.

Paper Structure

This paper contains 34 sections, 15 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: Simplified computational graph of a two-layer LLM with two attention heads (without MLPs). Nodes in each layer connect to all nodes in the next layer via residual connections. A highlighted arbitrary circuit is shown in blue. In a detailed graph, each incoming edge to an attention head splits into three: query, key, and value.
  • Figure 2: Left: The relative faithfulness of the candidate circuit compared to a random circuit from the reference distribution of varying sizes (x-axis). Dotted vertical lines indicate the actual size of the circuits. Right: The probability that a random circuit contains the canonical circuit.
  • Figure 3: The faithfulness of the circuit as we gradually knock down more edges from the canonical circuit. Edges are removed in order of their minimality score, starting with the least minimal. The dotted line shows the canonical circuit's faithfulness, and the solid line shows an empty circuit's faithfulness. Removing a few minimal edges does not significantly affect faithfulness.
  • Figure 4: Example of one step of the minimality test for the Docstring task: comparing knocking out a single edge of the candidate circuit (orange edge) against comparing knocking out a random edge of a randomly inflated circuit (the randomly added path is blue, the knocked out edge in the added path is red). Minimality tests whether knocking out the random red edge is more significant than knocking out the orange candidate edge.
  • Figure 5: The main figures display the change in task performance score induced by knocking out edge $e$, for every $e$ in each circuit. The changes in score are sorted from low to high along the x-axis. The right-adjacent vertical histograms show the change in task performance scores of the reference edges (ranging from $1000$ to $10,000$ edges). The shaded region covers the individual edges with corrected $p$-values that are below the significance threshold.

Theorems & Definitions (2)

  • Example 1: Greater-Than
  • Definition 1: Hilbert-Schmidt Independence Criterion (HSIC)