Table of Contents
Fetching ...

Training Language Models to Explain Their Own Computations

Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, Jacob Andreas

TL;DR

The paper investigates whether language models can be trained to faithfully describe their own internal computations by leveraging privileged access to their internals. It introduces a framework that trains explainer LMs using ground-truth from mechanistic interpretability methods to describe internal features, activation interventions, and input-based decision rules. Empirical results show self-explanations offer data-efficient, faithful explanations, with performance improving when explainer and target models are aligned and when activations are similar, across multiple tasks. The work proposes introspective interpretability as a scalable complement to existing interpretability tools and discusses broader implications for alignment and faithfulness.

Abstract

Can language models (LMs) learn to faithfully describe their internal computations? Are they better able to describe themselves than other models? We study the extent to which LMs' privileged access to their own internals can be leveraged to produce new techniques for explaining their behavior. Using existing interpretability techniques as a source of ground truth, we fine-tune LMs to generate natural language descriptions of (1) the information encoded by LM features, (2) the causal structure of LMs' internal activations, and (3) the influence of specific input tokens on LM outputs. When trained with only tens of thousands of example explanations, explainer models exhibit non-trivial generalization to new queries. This generalization appears partly attributable to explainer models' privileged access to their own internals: using a model to explain its own computations generally works better than using a *different* model to explain its computations (even if the other model is significantly more capable). Our results suggest not only that LMs can learn to reliably explain their internal computations, but that such explanations offer a scalable complement to existing interpretability methods.

Training Language Models to Explain Their Own Computations

TL;DR

The paper investigates whether language models can be trained to faithfully describe their own internal computations by leveraging privileged access to their internals. It introduces a framework that trains explainer LMs using ground-truth from mechanistic interpretability methods to describe internal features, activation interventions, and input-based decision rules. Empirical results show self-explanations offer data-efficient, faithful explanations, with performance improving when explainer and target models are aligned and when activations are similar, across multiple tasks. The work proposes introspective interpretability as a scalable complement to existing interpretability tools and discusses broader implications for alignment and faithfulness.

Abstract

Can language models (LMs) learn to faithfully describe their internal computations? Are they better able to describe themselves than other models? We study the extent to which LMs' privileged access to their own internals can be leveraged to produce new techniques for explaining their behavior. Using existing interpretability techniques as a source of ground truth, we fine-tune LMs to generate natural language descriptions of (1) the information encoded by LM features, (2) the causal structure of LMs' internal activations, and (3) the influence of specific input tokens on LM outputs. When trained with only tens of thousands of example explanations, explainer models exhibit non-trivial generalization to new queries. This generalization appears partly attributable to explainer models' privileged access to their own internals: using a model to explain its own computations generally works better than using a *different* model to explain its computations (even if the other model is significantly more capable). Our results suggest not only that LMs can learn to reliably explain their internal computations, but that such explanations offer a scalable complement to existing interpretability methods.

Paper Structure

This paper contains 75 sections, 8 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Overview of our methods for training models to describe their own internal procedures. As a first step, we extract answers to three types of question about the target model using different interpretability methods: (A) descriptions of features of the target model's hidden representations via an auto-interp procedure, (B) explanations about what parts of the model affect the output via activation patching, (C) explanations about what parts of the input affect the output via input ablation. As a second step, we fine-tune three explainers to produce each of the three types of explanations. (D) After fine-tuning various explainer models on various target models, we find evidence of privileged access for all three explanation types, and use this to obtain data-efficient explainers.
  • Figure 2: Training an explainer to predict feature descriptions. (A) We run a target model $\mathcal{M}$ across many input contexts. (B) We then observe vectors $v$ at layers $\ell$ and select an explanation $E$ that best describes the contexts in which $v$ is active, which we assess via a LM simulator. (C) Finally, we train an explainer model to respond with $E$ when given the question $q$ about what $v$ means at layer $\ell$, and find that it performs well on both held-out and OOD vectors $v$.
  • Figure 3: Training an explainer to predict activation patching outcomes. (A) We run a target model $\mathcal{M}$ on an input $x$ and obtain a prediction $\mathcal{M}(x)$. (B) We perform activation patching by running $\mathcal{M}$ on a counterfactual input $x'$---in this example, patching in vector $v$ from token $x_1$ and layer $\ell_1$ of the counterfactual run into the original run---and then construct an explanation $E$ about how the resulting prediction changes. (C) Finally, we train an explainer model to answer $E$ when given a question $q$ about the patching procedures.
  • Figure 4: Training an explainer to predict input ablation outcomes. We run a target model $\mathcal{M}$ on an input $x$ (A), then identify whether a part of the input $\tilde{x}\subset x$ was important to the output by rerunning $\mathcal{M}$ on $x\backslash\tilde{x}$ (B). Finally, we train an explainer model to answer question $q$ about how the output changes when $\tilde{x}$ is removed from $x$ (C).
  • Figure 5: Training self-as-explainer is more data-efficient than alternatives. We plot scaling curves of each explainer: the number of training samples per layer is plotted against explanation quality, as judged by a LM judge on held-out SAE features. We find that matching the explainer to the target (Llama-3.1-8B) is generally the most data-efficient, while using another model as explainer (Qwen) or any form of nearest neighbor is much less effective in low-data regimes, taking over 1k-10k samples per layer to beat an untrained SelfIE baseline.
  • ...and 3 more figures