Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Lucius Bushnaq; Jake Mendel; Stefan Heimersheim; Dan Braun; Nicholas Goldowsky-Dill; Kaarel Hänni; Cindy Wu; Marius Hobbhahn

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, Marius Hobbhahn

TL;DR

This work investigates degeneracy in neural network parameterizations as a primary obstacle to mechanistic interpretability, framing the problem through Singular Learning Theory and its local learning coefficient $\hat{\lambda}$ to quantify effective parameter count. It introduces Behavioral Loss $L_B$ and finite-data SLT to adapt degeneracy measures to real, finite datasets, defining an effective parameter count $N_{\text{eff}}(\epsilon) = 2\lambda_B(\epsilon)$ and enabling interpretable, data-aware analysis. The authors identify three internal degeneracy sources—activation and Jacobian subspace reductions, and synchronized nonlinearities (including attention patterns)—and show how these manifest as sparsity in interactions; they then construct the Interaction Basis, a parameterisation-invariant representation that diagonalizes layer interactions under certain conditions and promotes sparser cross-layer couplings. They further argue that modularity reduces the local learning coefficient by isolating degeneracies within modules, and propose a modularity metric based on interaction strengths, with companion experiments suggesting sparse, modular representations on toy models and limited transfer to large language models. Overall, the paper lays a framework for obtaining more interpretable, invariant representations that reflect the true computational structure of networks and informs practical directions for reverse engineering and circuit discovery.

Abstract

Mechanistic Interpretability aims to reverse engineer the algorithms implemented by neural networks by studying their weights and activations. An obstacle to reverse engineering neural networks is that many of the parameters inside a network are not involved in the computation being implemented by the network. These degenerate parameters may obfuscate internal structure. Singular learning theory teaches us that neural network parameterizations are biased towards being more degenerate, and parameterizations with more degeneracy are likely to generalize further. We identify 3 ways that network parameters can be degenerate: linear dependence between activations in a layer; linear dependence between gradients passed back to a layer; ReLUs which fire on the same subset of datapoints. We also present a heuristic argument that modular networks are likely to be more degenerate, and we develop a metric for identifying modules in a network that is based on this argument. We propose that if we can represent a neural network in a way that is invariant to reparameterizations that exploit the degeneracies, then this representation is likely to be more interpretable, and we provide some evidence that such a representation is likely to have sparser interactions. We introduce the Interaction Basis, a tractable technique to obtain a representation that is invariant to degeneracies from linear dependence of activations or Jacobians.

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

TL;DR

to quantify effective parameter count. It introduces Behavioral Loss

and finite-data SLT to adapt degeneracy measures to real, finite datasets, defining an effective parameter count

and enabling interpretable, data-aware analysis. The authors identify three internal degeneracy sources—activation and Jacobian subspace reductions, and synchronized nonlinearities (including attention patterns)—and show how these manifest as sparsity in interactions; they then construct the Interaction Basis, a parameterisation-invariant representation that diagonalizes layer interactions under certain conditions and promotes sparser cross-layer couplings. They further argue that modularity reduces the local learning coefficient by isolating degeneracies within modules, and propose a modularity metric based on interaction strengths, with companion experiments suggesting sparse, modular representations on toy models and limited transfer to large language models. Overall, the paper lays a framework for obtaining more interpretable, invariant representations that reflect the true computational structure of networks and informs practical directions for reverse engineering and circuit discovery.

Abstract

Paper Structure (38 sections, 57 equations, 1 figure)

This paper contains 38 sections, 57 equations, 1 figure.

Introduction
Singular learning theory and the effective parameter count
Background: the local learning coefficient
Modifying SLT for interpretability
Behavioral loss
Singular learning theory at finite data
Internal structures that contribute to degeneracy
Activations and Jacobians
Activation vectors spanning a low dimensional subspace
Relationship to weight norm
Jacobians spanning a low dimensional subspace
Jacobians between hidden layers
Synchronized nonlinearities
All synchronized
synchronized blocks
...and 23 more sections

Figures (1)

Figure 1: Example of a loss landscape with interacting free directions, from Carroll_2023, lightly edited. The loss does not change when changing $w_1$ alone or $w_2$ alone, so there are two free directions in the landscape. However, the loss does change when changing both $w_1$ and $w_2$ together, so the set of zero loss is cross-shaped rather than spanning the whole plane. Thus, despite there apparently being two free directions, the effective parameter count that characterises the dimensionality of the low loss volume is $1$ rather than $0$. Non-interacting sets of parameters have no joined terms like this in the loss function, so their free directions always span full subspaces with each other.

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

TL;DR

Abstract

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Authors

TL;DR

Abstract

Table of Contents

Figures (1)