Towards Unifying Interpretability and Control: Evaluation via Intervention
Usha Bhalla, Suraj Srinivas, Asma Ghandeharioun, Himabindu Lakkaraju
TL;DR
This paper proposes an intervention-centric unifying framework that maps latent representations $x$ to interpretable features $z = xD$, enabling counterfactual edits $ hat{x}'$ to steer model outputs. It unifies four interpretability methods—sparse autoencoders, Logit Lens, Tuned Lens, and probing—within an encoder–decoder paradigm and introduces two metrics, Intervention Success Rate and coherence–intervention tradeoff, plus an open-ended prompt dataset for benchmarking. Across GPT2-small, Gemma2-2b, Llama2-7b, and Llama3-8b, lens-based methods show higher intervention efficacy for simple topics, but interventions often reduce output coherence and complex concepts remain challenging, with prompting sometimes outperforming interpretable interventions. The findings advocate for systematic benchmarking and suggest that current interpretability techniques, while informative for explanations, may have limited practical utility for reliable control in real-world applications, guiding future development toward more faithful and actionable interventions.
Abstract
With the growing complexity and capability of large language models, a need to understand model reasoning has emerged, often motivated by an underlying goal of controlling and aligning models. While numerous interpretability and steering methods have been proposed as solutions, they are typically designed either for understanding or for control, seldom addressing both. Additionally, the lack of standardized applications, motivations, and evaluation metrics makes it difficult to assess methods' practical utility and efficacy. To address the aforementioned issues, we argue that intervention is a fundamental goal of interpretability and introduce success criteria to evaluate how well methods can control model behavior through interventions. To evaluate existing methods for this ability, we unify and extend four popular interpretability methods-sparse autoencoders, logit lens, tuned lens, and probing-into an abstract encoder-decoder framework, enabling interventions on interpretable features that can be mapped back to latent representations to control model outputs. We introduce two new evaluation metrics: intervention success rate and coherence-intervention tradeoff, designed to measure the accuracy of explanations and their utility in controlling model behavior. Our findings reveal that (1) while current methods allow for intervention, their effectiveness is inconsistent across features and models, (2) lens-based methods outperform SAEs and probes in achieving simple, concrete interventions, and (3) mechanistic interventions often compromise model coherence, underperforming simpler alternatives, such as prompting, and highlighting a critical shortcoming of current interpretability approaches in applications requiring control.
