Weight-sparse transformers have interpretable circuits
Leo Gao, Achyuta Rajaram, Jacob Coxon, Soham V. Govande, Bowen Baker, Dan Mossing
TL;DR
This work introduces weight-sparse transformers to achieve highly interpretable internal circuits, enabling the isolation of compact, task-specific subcircuits that are faithful to model behavior. By enforcing strong $L_0$ weight sparsity (and activation sparsity) and applying a novel pruning method, the authors extract minimal circuits for simple Python-code tasks and validate their faithfulness via mean ablations and targeted perturbations. They demonstrate that sparse models yield ~$16$-fold smaller circuits than dense counterparts at comparable pretraining loss, and that scaling the total parameter count improves the capability-interpretability frontier. Additionally, they extend the approach with bridges to relate sparse circuits to dense models, enabling interpretation without retraining frontier models. The work highlights both the promise and the practical challenges of mechanistic interpretability via sparse circuits, including computational inefficiency and incomplete faithfulness, while suggesting directions toward scalable, automated interpretability frameworks.
Abstract
Finding human-understandable circuits in language models is a central goal of the field of mechanistic interpretability. We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them. We study how these models scale and find that making weights sparser trades off capability for interpretability, and scaling model size improves the capability-interpretability frontier. However, scaling sparse models beyond tens of millions of nonzero parameters while preserving interpretability remains a challenge. In addition to training weight-sparse models de novo, we show preliminary results suggesting our method can also be adapted to explain existing dense models. Our work produces circuits that achieve an unprecedented level of human understandability and validates them with considerable rigor.
