Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits
Areeb Ahmad, Abhinav Joshi, Ashutosh Modi
TL;DR
The paper tackles the opacity of transformer internals by proposing a directional interpretability framework that treats attention and MLP blocks as superpositions of orthogonal low-rank subfunctions. It operationalizes this view with a unified linear representation via augmented weight matrices and a learnable diagonal mask over singular directions, optimizing for faithful reconstruction and sparsity. Empirically, it demonstrates that a small set of directions per component can reproduce behavior on IOI, GP, and GT tasks, and that specific heads (e.g., Head 9.6) encode multiple distinct subfunctions within separate singular directions, with logit receptors linking internal directions to token outputs. The findings suggest transformer computations are distributed yet modular, enabling finer mechanistic analyses, targeted editing, and new avenues for interpretable and controllable AI systems, while highlighting limitations and directions for future causal validation and scalability.
Abstract
Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood. Existing mechanistic interpretability methods typically treat attention heads and multilayer perceptron layers (MLPs) (the building blocks of a transformer architecture) as indivisible units, overlooking possibilities of functional substructure learned within them. In this work, we introduce a more fine-grained perspective that decomposes these components into orthogonal singular directions, revealing superposed and independent computations within a single head or MLP. We validate our perspective on widely used standard tasks like Indirect Object Identification (IOI), Gender Pronoun (GP), and Greater Than (GT), showing that previously identified canonical functional heads, such as the name mover, encode multiple overlapping subfunctions aligned with distinct singular directions. Nodes in a computational graph, that are previously identified as circuit elements show strong activation along specific low-rank directions, suggesting that meaningful computations reside in compact subspaces. While some directions remain challenging to interpret fully, our results highlight that transformer computations are more distributed, structured, and compositional than previously assumed. This perspective opens new avenues for fine-grained mechanistic interpretability and a deeper understanding of model internals.
