Table of Contents
Fetching ...

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun, Jake Mendel, Kaarel Hänni, Avery Griffin, Jörn Stöhler, Magdalena Wache, Marius Hobbhahn

TL;DR

This work tackles mechanistic interpretability by introducing the Local Interaction Basis (LIB), a two-stage, Jacobian-aligned transformation intended to produce a sparsely interacting, computationally-relevant feature basis for neural networks. By combining PCA whitening with an SVD-based rotation of layer-to-layer Jacobians and using integrated gradients to build interaction graphs, LIB seeks to reveal modular circuits and key feature interactions. Across a modular addition transformer and CIFAR-10 MLP, LIB identifies more computationally-relevant features and tends to yield sparser interactions than PCA, though interpretability gains are modest. On language models (GPT2-small and TinyStories-1M), LIB produces limited interpretability improvements and unreliable modular structure, suggesting that the assumption of a linear, non-overcomplete basis may not hold for large LMs and motivating future work on overcomplete representations or alternative bases.

Abstract

Mechanistic interpretability aims to understand the behavior of neural networks by reverse-engineering their internal computations. However, current methods struggle to find clear interpretations of neural network activations because a decomposition of activations into computational features is missing. Individual neurons or model components do not cleanly correspond to distinct features or functions. We present a novel interpretability method that aims to overcome this limitation by transforming the activations of the network into a new basis - the Local Interaction Basis (LIB). LIB aims to identify computational features by removing irrelevant activations and interactions. Our method drops irrelevant activation directions and aligns the basis with the singular vectors of the Jacobian matrix between adjacent layers. It also scales features based on their importance for downstream computation, producing an interaction graph that shows all computationally-relevant features and interactions in a model. We evaluate the effectiveness of LIB on modular addition and CIFAR-10 models, finding that it identifies more computationally-relevant features that interact more sparsely, compared to principal component analysis. However, LIB does not yield substantial improvements in interpretability or interaction sparsity when applied to language models. We conclude that LIB is a promising theory-driven approach for analyzing neural networks, but in its current form is not applicable to large language models.

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

TL;DR

This work tackles mechanistic interpretability by introducing the Local Interaction Basis (LIB), a two-stage, Jacobian-aligned transformation intended to produce a sparsely interacting, computationally-relevant feature basis for neural networks. By combining PCA whitening with an SVD-based rotation of layer-to-layer Jacobians and using integrated gradients to build interaction graphs, LIB seeks to reveal modular circuits and key feature interactions. Across a modular addition transformer and CIFAR-10 MLP, LIB identifies more computationally-relevant features and tends to yield sparser interactions than PCA, though interpretability gains are modest. On language models (GPT2-small and TinyStories-1M), LIB produces limited interpretability improvements and unreliable modular structure, suggesting that the assumption of a linear, non-overcomplete basis may not hold for large LMs and motivating future work on overcomplete representations or alternative bases.

Abstract

Mechanistic interpretability aims to understand the behavior of neural networks by reverse-engineering their internal computations. However, current methods struggle to find clear interpretations of neural network activations because a decomposition of activations into computational features is missing. Individual neurons or model components do not cleanly correspond to distinct features or functions. We present a novel interpretability method that aims to overcome this limitation by transforming the activations of the network into a new basis - the Local Interaction Basis (LIB). LIB aims to identify computational features by removing irrelevant activations and interactions. Our method drops irrelevant activation directions and aligns the basis with the singular vectors of the Jacobian matrix between adjacent layers. It also scales features based on their importance for downstream computation, producing an interaction graph that shows all computationally-relevant features and interactions in a model. We evaluate the effectiveness of LIB on modular addition and CIFAR-10 models, finding that it identifies more computationally-relevant features that interact more sparsely, compared to principal component analysis. However, LIB does not yield substantial improvements in interpretability or interaction sparsity when applied to language models. We conclude that LIB is a promising theory-driven approach for analyzing neural networks, but in its current form is not applicable to large language models.
Paper Structure (47 sections, 25 equations, 22 figures, 2 algorithms)

This paper contains 47 sections, 25 equations, 22 figures, 2 algorithms.

Figures (22)

  • Figure 1: The Local Interaction Basis (LIB) is a basis for neural network activations where interactions between features should be sparser and more modular. (1) We start with a selection of layers from the neural network. (2) We transform the activations in these layers into the LIB, which represents computationally-relevant features, removes features that don't affect the output, and minimizes interactions between features in adjacent layers. (3) We then quantify the interactions between features using integrated gradients, creating an interaction graph that represents the extent to which preceding nodes affect subsequent nodes. (4) We use the resulting interaction graph to analyze and interpret features in the neural network, and to identify modules that correspond to distinct circuits in the model's computation.
  • Figure 2: Visualization of the LIB transformation. This figure shows an illustration of activations (top) and gradients (bottom) as they get transformed into the LIB. The first step is a PCA of the activations in every layer in order to drop activation directions with near-zero variance and to whiten the activations. The second step is based on a dataset of gradients, that is, the set of gradients of every feature in the next layer with respect to every direction in the current layer on every data point (this is a larger dataset than the activations). We perform an SVD (singular value decomposition) on the Jacobians to find the right singular vectors and singular values. This allows us to drop directions that are not important for the next layer, and to align the activations singular vectors to sparsify the interactions between features in adjacent layers.
  • Figure 3: LIB interaction graph of a modular addition transformer. The three layers correspond to activations after the embedding, directly after the attention, and just before the unembedding. The individual nodes represent LIB features, and the thickness of the edges shows the interaction strength between features. The nodes are colored by module membership (Leiden algorithm), and labeled by their function index ($\hat{f}_0, \hat{f}_1,\dots$, in order of decreasing functional importance) and their Fourier interpretation.
  • Figure 4: Monosemanticity of features in LIB and PCA basis.
  • Figure 5: Number of nodes required to preserve >99.9% accuracy for LIB and PCA on five modular addition transformers trained with different random seeds.
  • ...and 17 more figures