Table of Contents
Fetching ...

Towards Unifying Interpretability and Control: Evaluation via Intervention

Usha Bhalla, Suraj Srinivas, Asma Ghandeharioun, Himabindu Lakkaraju

TL;DR

This paper proposes an intervention-centric unifying framework that maps latent representations $x$ to interpretable features $z = xD$, enabling counterfactual edits $ hat{x}'$ to steer model outputs. It unifies four interpretability methods—sparse autoencoders, Logit Lens, Tuned Lens, and probing—within an encoder–decoder paradigm and introduces two metrics, Intervention Success Rate and coherence–intervention tradeoff, plus an open-ended prompt dataset for benchmarking. Across GPT2-small, Gemma2-2b, Llama2-7b, and Llama3-8b, lens-based methods show higher intervention efficacy for simple topics, but interventions often reduce output coherence and complex concepts remain challenging, with prompting sometimes outperforming interpretable interventions. The findings advocate for systematic benchmarking and suggest that current interpretability techniques, while informative for explanations, may have limited practical utility for reliable control in real-world applications, guiding future development toward more faithful and actionable interventions.

Abstract

With the growing complexity and capability of large language models, a need to understand model reasoning has emerged, often motivated by an underlying goal of controlling and aligning models. While numerous interpretability and steering methods have been proposed as solutions, they are typically designed either for understanding or for control, seldom addressing both. Additionally, the lack of standardized applications, motivations, and evaluation metrics makes it difficult to assess methods' practical utility and efficacy. To address the aforementioned issues, we argue that intervention is a fundamental goal of interpretability and introduce success criteria to evaluate how well methods can control model behavior through interventions. To evaluate existing methods for this ability, we unify and extend four popular interpretability methods-sparse autoencoders, logit lens, tuned lens, and probing-into an abstract encoder-decoder framework, enabling interventions on interpretable features that can be mapped back to latent representations to control model outputs. We introduce two new evaluation metrics: intervention success rate and coherence-intervention tradeoff, designed to measure the accuracy of explanations and their utility in controlling model behavior. Our findings reveal that (1) while current methods allow for intervention, their effectiveness is inconsistent across features and models, (2) lens-based methods outperform SAEs and probes in achieving simple, concrete interventions, and (3) mechanistic interventions often compromise model coherence, underperforming simpler alternatives, such as prompting, and highlighting a critical shortcoming of current interpretability approaches in applications requiring control.

Towards Unifying Interpretability and Control: Evaluation via Intervention

TL;DR

This paper proposes an intervention-centric unifying framework that maps latent representations to interpretable features , enabling counterfactual edits to steer model outputs. It unifies four interpretability methods—sparse autoencoders, Logit Lens, Tuned Lens, and probing—within an encoder–decoder paradigm and introduces two metrics, Intervention Success Rate and coherence–intervention tradeoff, plus an open-ended prompt dataset for benchmarking. Across GPT2-small, Gemma2-2b, Llama2-7b, and Llama3-8b, lens-based methods show higher intervention efficacy for simple topics, but interventions often reduce output coherence and complex concepts remain challenging, with prompting sometimes outperforming interpretable interventions. The findings advocate for systematic benchmarking and suggest that current interpretability techniques, while informative for explanations, may have limited practical utility for reliable control in real-world applications, guiding future development toward more faithful and actionable interventions.

Abstract

With the growing complexity and capability of large language models, a need to understand model reasoning has emerged, often motivated by an underlying goal of controlling and aligning models. While numerous interpretability and steering methods have been proposed as solutions, they are typically designed either for understanding or for control, seldom addressing both. Additionally, the lack of standardized applications, motivations, and evaluation metrics makes it difficult to assess methods' practical utility and efficacy. To address the aforementioned issues, we argue that intervention is a fundamental goal of interpretability and introduce success criteria to evaluate how well methods can control model behavior through interventions. To evaluate existing methods for this ability, we unify and extend four popular interpretability methods-sparse autoencoders, logit lens, tuned lens, and probing-into an abstract encoder-decoder framework, enabling interventions on interpretable features that can be mapped back to latent representations to control model outputs. We introduce two new evaluation metrics: intervention success rate and coherence-intervention tradeoff, designed to measure the accuracy of explanations and their utility in controlling model behavior. Our findings reveal that (1) while current methods allow for intervention, their effectiveness is inconsistent across features and models, (2) lens-based methods outperform SAEs and probes in achieving simple, concrete interventions, and (3) mechanistic interventions often compromise model coherence, underperforming simpler alternatives, such as prompting, and highlighting a critical shortcoming of current interpretability approaches in applications requiring control.

Paper Structure

This paper contains 25 sections, 1 equation, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Our proposed intervention framework, which encodes model latent representations, $x$, into human-interpretable features, $z = xD$, that can then be perturbed to $z'$ and decoded back into counterfactual latent representations, $\hat{x}'$.
  • Figure 2: Evaluation of the Intervention Success Rate with respect to edit distance for each method on four models for the simple intervention topics. Note that normalized edit distance is a proxy for intervention strength that is comparable across methods. Logit Lens generally outperforms all other methods.
  • Figure 3: Intervened output coherence measured with respect to intervention success rate. The solid horizontal line shows the mean of coherence scores for the clean model outputs, and the dashed lines show $\pm$1 standard deviation around the mean.
  • Figure 4: Relationship between intervention success rate and coherence for three complex features: religious references (top), gendered language (middle), and French language (bottom) for Gemma2-2b (left) and Llama3-8b (right).
  • Figure 5: Examples of intervened model outputs for intervention feature 'yoga' at both the optimal intervention strength (left) and the maximum intervention strength tested (left). Outputs degrade into incoherent repetition at high intervention strength for all methods.
  • ...and 8 more figures