Table of Contents
Fetching ...

Structural Inference: Interpreting Small Language Models with Susceptibilities

Garrett Baker, George Wang, Jesse Hoogland, Daniel Murfet

TL;DR

This work formulates susceptibilities as a linear-response interpretability tool by treating neural networks as Bayesian statistical-mechanical systems, where infinitesimal data perturbations induce a first-order shift in posterior expectations of component observables. It develops a local-SGLD-based estimation pipeline to compute susceptibilities and constructs a data-response matrix whose low-rank structure reveals modular internal circuits such as multigram and induction heads in a 3M-parameter transformer. The authors connect susceptibilities to component-wise losses and data distribution changes, and introduce structural inference by factorizing the susceptibility matrix into mode- and head-coupling terms, enabling automatic discovery of internal structure and the balance between expression and suppression. Empirically, the approach identifies a uniform mode, word-part versus induction-pattern distinctions, and a battle between induction heads, aligning with prior mechanistic findings and providing a scalable, theoretically grounded path to mechanistic interpretability. Overall, susceptibilities link data-driven perturbations to internal network organization and generalization via local learning coefficients, offering a principled framework for understanding and dissecting large neural networks.

Abstract

We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. We combine these susceptibilities into a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer.

Structural Inference: Interpreting Small Language Models with Susceptibilities

TL;DR

This work formulates susceptibilities as a linear-response interpretability tool by treating neural networks as Bayesian statistical-mechanical systems, where infinitesimal data perturbations induce a first-order shift in posterior expectations of component observables. It develops a local-SGLD-based estimation pipeline to compute susceptibilities and constructs a data-response matrix whose low-rank structure reveals modular internal circuits such as multigram and induction heads in a 3M-parameter transformer. The authors connect susceptibilities to component-wise losses and data distribution changes, and introduce structural inference by factorizing the susceptibility matrix into mode- and head-coupling terms, enabling automatic discovery of internal structure and the balance between expression and suppression. Empirically, the approach identifies a uniform mode, word-part versus induction-pattern distinctions, and a battle between induction heads, aligning with prior mechanistic findings and providing a scalable, theoretically grounded path to mechanistic interpretability. Overall, susceptibilities link data-driven perturbations to internal network organization and generalization via local learning coefficients, offering a principled framework for understanding and dissecting large neural networks.

Abstract

We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. We combine these susceptibilities into a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer.

Paper Structure

This paper contains 72 sections, 4 theorems, 74 equations, 35 figures, 6 tables, 1 algorithm.

Key Result

Lemma 2.2

The susceptibility for an observable $\phi$ is computed by where $\operatorname{Cov}_\beta[ \phi, \Delta L ] = \langle \phi \, \Delta L \rangle_\beta - \langle \phi \rangle_\beta \langle \Delta L \rangle_\beta$ and $\Delta L = \frac{\partial L^h}{\partial h} \Bigr|_{h=0}$.

Figures (35)

  • Figure 1: We introduce a new framework for interpretability based on Bayesian learning theory and statistical mechanics for automatically discovering internal structure in neural networks.
  • Figure 2: Per-token susceptibilities reveal patterns of expression and suppression. In this context, the beginning of three lines are shown. Each line is repeated six times, and each repeat is divided into three color bars. These bars correspond to individual heads - in layer $0$ and - in layer $1$, as shown on the left. We see that the susceptibility increases for some heads on repeated token sequences like PriorArt. These positive susceptibilities are examples of the suppression of induction patterns by layer $1$ multigram heads, an indicator of functional organization.
  • Figure 3: Susceptibilities decompose into interpretable loadings over data. Among the tokens with the coefficients of the largest magnitude in each principal component for the per-token susceptibility PCA, the percentage following each of the six patterns.
  • Figure 4: Susceptibilities decompose into interpretable loadings over components. The loadings of the top three principal components for per-token susceptibility PCA on attention heads.
  • Figure 5: Visualizing per-token susceptibilities for three heads on a sample from arxiv. Each token is highlighted in three segments (top, middle, buttom) which correspond to the per-token susceptibilities for three heads ( , , ). Green means positive susceptibility and red means negative (more solid color means higher magnitude, with zero being white).
  • ...and 30 more figures

Theorems & Definitions (23)

  • Definition 2.1
  • Lemma 2.2
  • proof
  • Lemma 2.3
  • proof
  • Definition 2.4
  • Definition 2.5
  • Definition 3.1
  • proof : Proof of Lemma \ref{['lemma:persamp_suscep']}
  • proof : Proof of Lemma \ref{['lemma:deltaLisdiff']}
  • ...and 13 more