Identifying Sparsely Active Circuits Through Local Loss Landscape Decomposition
Brianna Chrisman, Lucius Bushnaq, Lee Sharkey
TL;DR
The paper tackles the challenge of interpreting large neural networks by shifting focus from activation-space to parameter-space and introducing Local Loss Landscape Decomposition (L3D). L3D learns low-rank subnetworks that reconstruct the gradient of divergence between a sample's output and a reference output, enabling circuit-level interpretability and targeted interventions. Through progressively complex toy models and preliminary real-world experiments on a transformer and a CNN, the method demonstrates the ability to recover interpretable subnetworks, localize their effects to relevant samples, and offer a pathway toward scalable circuit discovery in real models. This provides a principled framework for understanding and potentially guiding behavior in large models by manipulating compact, interpretable parameter directions.
Abstract
Much of mechanistic interpretability has focused on understanding the activation spaces of large neural networks. However, activation space-based approaches reveal little about the underlying circuitry used to compute features. To better understand the circuits employed by models, we introduce a new decomposition method called Local Loss Landscape Decomposition (L3D). L3D identifies a set of low-rank subnetworks: directions in parameter space of which a subset can reconstruct the gradient of the loss between any sample's output and a reference output vector. We design a series of progressively more challenging toy models with well-defined subnetworks and show that L3D can nearly perfectly recover the associated subnetworks. Additionally, we investigate the extent to which perturbing the model in the direction of a given subnetwork affects only the relevant subset of samples. Finally, we apply L3D to a real-world transformer model and a convolutional neural network, demonstrating its potential to identify interpretable and relevant circuits in parameter space.
