Table of Contents
Fetching ...

How Reliable are Causal Probing Interventions?

Marc Canby, Adam Davies, Chirag Rastogi, Julia Hockenmaier

TL;DR

This work develops a formal reliability framework for causal probing of LLMs, introducing completeness and selectivity as core desiderata measured via validation probes. It provides a unified way to compare linear and nonlinear interventions (removal vs counterfactual) and demonstrates that nonlinear gradient-based interventions tend to be more reliable, with stronger downstream effects on tasks like subject-verb agreement. The study shows a persistent trade-off between completeness and selectivity across models and layers, and highlights the value of calibrating intervention hyperparameters to balance these goals. The results motivate cautious interpretation of causal probing findings and suggest that multi-layer, nonlinearity-aware analyses are essential for robust insights into how latent representations drive model behavior.

Abstract

Causal probing aims to analyze foundation models by examining how intervening on their representation of various latent properties impacts their outputs. Recent works have cast doubt on the theoretical basis of several leading causal probing methods, but it has been unclear how to systematically evaluate the effectiveness of these methods in practice. To address this, we define two key causal probing desiderata: completeness (how thoroughly the representation of the target property has been transformed) and selectivity (how little non-targeted properties have been impacted). We find that there is an inherent tradeoff between the two, which we define as reliability, their harmonic mean. We introduce an empirical analysis framework to measure and evaluate these quantities, allowing us to make the first direct comparisons between different families of leading causal probing methods (e.g., linear vs. nonlinear, or concept removal vs. counterfactual interventions). We find that: (1) all methods show a clear tradeoff between completeness and selectivity; (2) more complete and reliable methods have a greater impact on LLM behavior; and (3) nonlinear interventions are almost always more reliable than linear interventions. Our project webpage is available at: https://ahdavies6.github.io/causal_probing_reliability/

How Reliable are Causal Probing Interventions?

TL;DR

This work develops a formal reliability framework for causal probing of LLMs, introducing completeness and selectivity as core desiderata measured via validation probes. It provides a unified way to compare linear and nonlinear interventions (removal vs counterfactual) and demonstrates that nonlinear gradient-based interventions tend to be more reliable, with stronger downstream effects on tasks like subject-verb agreement. The study shows a persistent trade-off between completeness and selectivity across models and layers, and highlights the value of calibrating intervention hyperparameters to balance these goals. The results motivate cautious interpretation of causal probing findings and suggest that multi-layer, nonlinearity-aware analyses are essential for robust insights into how latent representations drive model behavior.

Abstract

Causal probing aims to analyze foundation models by examining how intervening on their representation of various latent properties impacts their outputs. Recent works have cast doubt on the theoretical basis of several leading causal probing methods, but it has been unclear how to systematically evaluate the effectiveness of these methods in practice. To address this, we define two key causal probing desiderata: completeness (how thoroughly the representation of the target property has been transformed) and selectivity (how little non-targeted properties have been impacted). We find that there is an inherent tradeoff between the two, which we define as reliability, their harmonic mean. We introduce an empirical analysis framework to measure and evaluate these quantities, allowing us to make the first direct comparisons between different families of leading causal probing methods (e.g., linear vs. nonlinear, or concept removal vs. counterfactual interventions). We find that: (1) all methods show a clear tradeoff between completeness and selectivity; (2) more complete and reliable methods have a greater impact on LLM behavior; and (3) nonlinear interventions are almost always more reliable than linear interventions. Our project webpage is available at: https://ahdavies6.github.io/causal_probing_reliability/
Paper Structure (44 sections, 9 equations, 12 figures, 3 tables)

This paper contains 44 sections, 9 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Causal Probing and Our Reliability Framework. The process of causal probing is shown in the gray box, with our reliability framework in the purple box. Causal Probing: embeddings ${\mathbf{h}}^l$ are extracted from layer $L = l$ of a model and used to train a probe $g_Z$ to predict the value $Z = z$ of property $Z$ from embeddings (e.g., the number of the subject, boy, is $Z = \texttt{Sg}$ for singular). A causal probing intervention $\mathop{\mathrm{do}}\limits(Z = \texttt{Pl})$ uses the probe $g_Z$ to modify the representation encoded by ${\mathbf{h}}^l$ to encode plural instead. The resulting intervened embedding $\hat{{\mathbf{h}}}^l$ is fed back into the model at layer $L = l+1$ and the forward pass is completed, changing the original prediction opens to the intervened prediction open.Reliability Framework: instead of feeding the intervened embedding $\hat{{\mathbf{h}}}^l$ back into the model, it is passed alongside ${\mathbf{h}}^l$ to validation probes $\{ v_{Z_i} \}$ that independently test whether the intervention has had the intended effect. Completeness is measured as the similarity between the validation probe prediction and the target distribution for the intervention (e.g., a perfectly complete counterfactual intervention $\mathop{\mathrm{do}}\limits(Z = \texttt{Pl})$ would lead validation probe $v_Z$ to predict plural with probability $P_v(Z = \texttt{Pl} | \hat{{\mathbf{h}}}^l) = 1$), and selectivity is the similarity between the validation probe distribution for non-targeted properties before and after the intervention (which, for a perfectly selective intervention, should not change).
  • Figure 2: Completeness, selectivity, reliability, and $\Delta \text{Task Acc}$ for all interventions in the final layer of Pythia-160M. Each point in both plots corresponds to a different hyperparameter setting. (\ref{['sec:interhyp']} contains analogous results for all other models.)
  • Figure 3: Maximum reliability by layer for each intervention across all layers of Pythia-160M. (\ref{['sec:layerwise_other_models']} contains analogous results for all other models.)
  • Figure 4: (BERT) Completeness, selectivity, reliability, and $\Delta \text{Task Acc}$ for all interventions in BERT's final layer. Each point in both plots corresponds to a different hyperparameter setting.
  • Figure 5: (GPT2) Completeness, selectivity, reliability, and $\Delta \text{Task Acc}$ for all interventions in the final layer of GPT2. Each point in both plots corresponds to a different hyperparameter setting.
  • ...and 7 more figures