Structural Inference: Interpreting Small Language Models with Susceptibilities
Garrett Baker, George Wang, Jesse Hoogland, Daniel Murfet
TL;DR
This work formulates susceptibilities as a linear-response interpretability tool by treating neural networks as Bayesian statistical-mechanical systems, where infinitesimal data perturbations induce a first-order shift in posterior expectations of component observables. It develops a local-SGLD-based estimation pipeline to compute susceptibilities and constructs a data-response matrix whose low-rank structure reveals modular internal circuits such as multigram and induction heads in a 3M-parameter transformer. The authors connect susceptibilities to component-wise losses and data distribution changes, and introduce structural inference by factorizing the susceptibility matrix into mode- and head-coupling terms, enabling automatic discovery of internal structure and the balance between expression and suppression. Empirically, the approach identifies a uniform mode, word-part versus induction-pattern distinctions, and a battle between induction heads, aligning with prior mechanistic findings and providing a scalable, theoretically grounded path to mechanistic interpretability. Overall, susceptibilities link data-driven perturbations to internal network organization and generalization via local learning coefficients, offering a principled framework for understanding and dissecting large neural networks.
Abstract
We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. We combine these susceptibilities into a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer.
