Table of Contents
Fetching ...

Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

Areeb Ahmad, Abhinav Joshi, Ashutosh Modi

TL;DR

The paper tackles the opacity of transformer internals by proposing a directional interpretability framework that treats attention and MLP blocks as superpositions of orthogonal low-rank subfunctions. It operationalizes this view with a unified linear representation via augmented weight matrices and a learnable diagonal mask over singular directions, optimizing for faithful reconstruction and sparsity. Empirically, it demonstrates that a small set of directions per component can reproduce behavior on IOI, GP, and GT tasks, and that specific heads (e.g., Head 9.6) encode multiple distinct subfunctions within separate singular directions, with logit receptors linking internal directions to token outputs. The findings suggest transformer computations are distributed yet modular, enabling finer mechanistic analyses, targeted editing, and new avenues for interpretable and controllable AI systems, while highlighting limitations and directions for future causal validation and scalability.

Abstract

Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood. Existing mechanistic interpretability methods typically treat attention heads and multilayer perceptron layers (MLPs) (the building blocks of a transformer architecture) as indivisible units, overlooking possibilities of functional substructure learned within them. In this work, we introduce a more fine-grained perspective that decomposes these components into orthogonal singular directions, revealing superposed and independent computations within a single head or MLP. We validate our perspective on widely used standard tasks like Indirect Object Identification (IOI), Gender Pronoun (GP), and Greater Than (GT), showing that previously identified canonical functional heads, such as the name mover, encode multiple overlapping subfunctions aligned with distinct singular directions. Nodes in a computational graph, that are previously identified as circuit elements show strong activation along specific low-rank directions, suggesting that meaningful computations reside in compact subspaces. While some directions remain challenging to interpret fully, our results highlight that transformer computations are more distributed, structured, and compositional than previously assumed. This perspective opens new avenues for fine-grained mechanistic interpretability and a deeper understanding of model internals.

Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

TL;DR

The paper tackles the opacity of transformer internals by proposing a directional interpretability framework that treats attention and MLP blocks as superpositions of orthogonal low-rank subfunctions. It operationalizes this view with a unified linear representation via augmented weight matrices and a learnable diagonal mask over singular directions, optimizing for faithful reconstruction and sparsity. Empirically, it demonstrates that a small set of directions per component can reproduce behavior on IOI, GP, and GT tasks, and that specific heads (e.g., Head 9.6) encode multiple distinct subfunctions within separate singular directions, with logit receptors linking internal directions to token outputs. The findings suggest transformer computations are distributed yet modular, enabling finer mechanistic analyses, targeted editing, and new avenues for interpretable and controllable AI systems, while highlighting limitations and directions for future causal validation and scalability.

Abstract

Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood. Existing mechanistic interpretability methods typically treat attention heads and multilayer perceptron layers (MLPs) (the building blocks of a transformer architecture) as indivisible units, overlooking possibilities of functional substructure learned within them. In this work, we introduce a more fine-grained perspective that decomposes these components into orthogonal singular directions, revealing superposed and independent computations within a single head or MLP. We validate our perspective on widely used standard tasks like Indirect Object Identification (IOI), Gender Pronoun (GP), and Greater Than (GT), showing that previously identified canonical functional heads, such as the name mover, encode multiple overlapping subfunctions aligned with distinct singular directions. Nodes in a computational graph, that are previously identified as circuit elements show strong activation along specific low-rank directions, suggesting that meaningful computations reside in compact subspaces. While some directions remain challenging to interpret fully, our results highlight that transformer computations are more distributed, structured, and compositional than previously assumed. This perspective opens new avenues for fine-grained mechanistic interpretability and a deeper understanding of model internals.

Paper Structure

This paper contains 35 sections, 23 equations, 11 figures, 11 tables, 2 algorithms.

Figures (11)

  • Figure 1: Learned singular value masks for Query-Key ($\mathbf{W_{aug}^{QK}}$) matrices across all attention heads in the model. High mask activations correspond to circuit components previously identified for the IOI task wang2022interpretabilitywildcircuitindirect. Each head exhibits sparsity along its singular directions, revealing the fine-grained subspaces driving task behavior.
  • Figure 2: The Figure shows intervention in the logit receptor for the Gender Pronoun task. Controlling the logit receptor using a scalar intervention modifies the predicted logits.
  • Figure 3: Mean activation of gender-related directions conditioned on Masculine versus Feminine prompt context. The x-axis plots the mean activation $\mathbb{E}[\nu^{\top}u \mid \text{prompt context=he}]$ and the y-axis plots mean $\mathbb{E}[\nu^{\top}u \mid \text{prompt context=she}]$. Error bars show one standard deviation. The dashed diagonal line represents $y=x$, where activations for both pronouns would be equal.
  • Figure 4: Causal interventions (scaling + swapping) show that singular directions control gender pronoun prediction. The plot displays logit differences (Correct Pronoun $-$ Opposite Pronoun) and flipping rates after intervention. Singular values $\sigma_i$ are scaled by an integer factor ($\sigma_{\text{scale}}$) in $\sigma_i (\tilde{a_i} - \nu^\top u_i) v_i^\top$, leading to near-complete prediction reversal at higher scales. This provides causal evidence that these directions are key computational units underlying gender pronoun resolution.
  • Figure 5: Learned singular value masks for OV ($\mathbf{W_{aug}^{OV}}$) matrices across all attention heads in the model. The masks show heads with high activation across multiple singular dimensions correspond to circuit components previously identified by wang2022interpretabilitywildcircuitindirect for the IOI task.
  • ...and 6 more figures