Table of Contents
Fetching ...

Addressing divergent representations from causal interventions on neural networks

Satchel Grant, Simon Jerome Han, Alexa R. Tartaglini, Christopher Potts

TL;DR

This work investigates whether causal interventions used for mechanistic interpretability produce representations that diverge from a model's natural latent distribution, potentially undermining the faithfulness of explanations. It provides theoretical and empirical evidence that divergence is common across activation patching, SAE projections, and Distributed Alignment Search (DAS), and it distinguishes harmless divergences (within null-space or decision boundaries) from pernicious ones (off-manifold activations and dormant behavioral changes). The authors propose mitigating divergence via the Counterfactual Latent (CL) loss, including a causal-subspace–targeted variant, and demonstrate reduced representational divergence while preserving or improving interpretability and out-of-distribution (OOD) performance in synthetic tasks and Boundless DAS experiments. These results offer a practical step toward more reliable causal interventions for mechanistic interpretability, highlighting the need for reporting divergence and developing robust, manifold-constrained intervention methods.

Abstract

A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate theoretically and empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two cases of such divergences: "harmless" divergences that occur in the behavioral null-space of the layer(s) of interest, and "pernicious" divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we apply and modify the Counterfactual Latent (CL) loss from Grant (2025) allowing representations from causal interventions to remain closer to the natural distribution, reducing the likelihood of harmful divergences while preserving the interpretive power of the interventions. Together, these results highlight a path towards more reliable interpretability methods.

Addressing divergent representations from causal interventions on neural networks

TL;DR

This work investigates whether causal interventions used for mechanistic interpretability produce representations that diverge from a model's natural latent distribution, potentially undermining the faithfulness of explanations. It provides theoretical and empirical evidence that divergence is common across activation patching, SAE projections, and Distributed Alignment Search (DAS), and it distinguishes harmless divergences (within null-space or decision boundaries) from pernicious ones (off-manifold activations and dormant behavioral changes). The authors propose mitigating divergence via the Counterfactual Latent (CL) loss, including a causal-subspace–targeted variant, and demonstrate reduced representational divergence while preserving or improving interpretability and out-of-distribution (OOD) performance in synthetic tasks and Boundless DAS experiments. These results offer a practical step toward more reliable causal interventions for mechanistic interpretability, highlighting the need for reporting divergence and developing robust, manifold-constrained intervention methods.

Abstract

A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate theoretically and empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two cases of such divergences: "harmless" divergences that occur in the behavioral null-space of the layer(s) of interest, and "pernicious" divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we apply and modify the Counterfactual Latent (CL) loss from Grant (2025) allowing representations from causal interventions to remain closer to the natural distribution, reducing the likelihood of harmful divergences while preserving the interpretive power of the interventions. Together, these results highlight a path towards more reliable interpretability methods.

Paper Structure

This paper contains 50 sections, 42 equations, 9 figures, 1 algorithm.

Figures (9)

  • Figure 1: Causal interventions can recruit hidden circuits that produce misleadingly confirmatory or dormant behavior.(a) Consider natural pathways (dashed arrows) for two classes A and B that carry activity to different behavioral outputs $y$. In a hypothetical intervention meant to find path A, patching $h^1$ with a divergent representation can activate distinct, hidden pathways (solid arrows) that result in misleadingly confirmatory behavior (orange) and/or undetected behavior (red). (b) Consider 2D projections of the neural activity of $h^1$ for a different network that classifies states into one of 10 classes (denoted by hue). Suppose that natural representations (dark points) lie within well-defined decision boundaries (dashed lines) and covary along causal axes, and that intervened representations (light points) are constructed by patching the first axis from a sampled natural representation. Although these representations diverge from the natural distribution, this can be harmless (top) or pernicious (bottom) depending on the network's functional landscape. In particular, it can be pernicious if the network has a functional landscape where intervened activity unknowingly recruits hidden circuits (visualized as an orange region) or crosses dormant behavioral boundaries (red regions).
  • Figure 2: Representational divergence is a common occurrence across various interventions. (a) Directly replacing a coordinate value in one natural representation (orange) with the value from another will eventually create divergent representations (blue). (b) Top two principal components of natural and corresponding intervened representations, taken from the residual stream at the intervention position and with PCA is performed over the combined set of natural and intervened vectors, for three popular causal intervention techniques: a replication of feng2024languagemodelsbindentities for mean difference patching, reconstructed vectors for a single transformer layer using SAELens bloom2024saelens for sparse autoencoder, and interchange interventions for Boundless DAS wu2024alpacadas. (c) L2 distance between natural and corresponding intervened representations, and Earth Mover's Distance (EMD) between natural and intervened distributions (with baseline comparing the natural distribution to itself).
  • Figure 3: The CL loss reduces representational divergence and can improve out-of-distribution generalization.(a) PCA of natural (orange) and intervened (blue) representations in the Boundless DAS setting presented in wu2024alpacadas for two CL loss weightings with the same final IIA. (b) IIA (orange) and divergence (purple) of intervened representations from Section \ref{['sec:clboundlessdas']} as a function of CL loss weight ($\epsilon$). (c) Diagram of CL loss; rectangles are model representations and $x_1$ and $x_2$ are deterministic values of the representations along the two synthetic causal dimensions shown in panels (d) and (e). We patch the $x_2$ value from source to target using DAS and define the CL representation as the average of all natural representations that possess the same variable values as the post-intervention representation. (d) and (e) two causal feature dimensions of representations from a synthetic dataset consisting of ten classes (colors), with both natural (dark) and intervened (light) representations shown. (d) shows results from DAS trained using behavior only, (e) shows DAS trained using only the CL loss. (f) performance of alignment matrices trained on one task and evaluated on another that uses the same causal dimensions. CL loss leads to higher OOD performance.
  • Figure 4: A number of additional divergence measures to demonstrate the difference between the natural and intervened distributions. Each is labeled by its y-axis. Each metric is computed over a random sample of natural vectors to simulate the natural manifold, and a sampled set of intervened or natural vectors for which to measure the distance from the natural distribution. We refer to this distribution as the "compared" distribution. The sampled intervened and natural vectors are always the "ground-truth pair" described at the beginning of Appendix \ref{['sup:empiricalIRD']}. Nearest Cosine Distance: refers to the cosine distance to the nearest sample in the natural manifold. Multiple sampes in the compared distribution can share the same natural sample. This value is averaged over all compared samples. Nearest L2 Distance refers to the cosine distance to the nearest sample in the natural manifold. Multiple sampes in the compared distribution can share the same natural sample. This value is averaged over all compared samples. Min Cos Pairing refers to the lowest cost pairing where cost is the cosine distance between two samples. Vector pairs are exclusive. This value is normalized by the number of samples. Min L2 Pairing refers to the lowest cost pairing where cost is the L2 distance between two samples. Vector pairs are exclusive. This value is normalized by the number of samples. Local PCA Distance refers to the distance to the manifold created using a local PCA of the nearest neighbors. See Appendix \ref{['sup:divergences']}. EMD refers to the Earth Mover's Distance. See Appendix \ref{['sup:divergences']}. KDE refers to the Kernel Density Estimation score. See Appendix \ref{['sup:divergences']}. Local Linear Reconstruction refers to the local linear reconstruction error. See Appendix \ref{['sup:divergences']}.
  • Figure 5: Visualization of the different synthetic tasks used for Figure \ref{['fig:clevaluation']}. The Default Task is split into two partitions, both withholding two classes that are contained in the other partition. The OOD task is also split into two partitions, both consisting of 4 classes. The Dense partition consists of a tighter cluster than the Sparse.
  • ...and 4 more figures