Table of Contents
Fetching ...

Do Natural Language Descriptions of Model Activations Convey Privileged Information?

Millicent Li, Alberto Mario Ceballos Arroyo, Giordano Rogers, Naomi Saphra, Byron C. Wallace

TL;DR

This paper questions whether natural language descriptions of LLM activations (activation verbalization) truly reveal privileged internal knowledge or merely mirror input information. Through controlled experiments with Patchscopes and LatentQATeachingllmsdecode-style verbalizers, inversion-based reconstructions, and the novel PersonaQA benchmarks, the authors show that many verbalization evaluations do not require access to target activations and can be solved using input alone or the verbalizer's own knowledge. They demonstrate that verbalizers often reflect their own world knowledge, especially when M1 and M2 knowledge misalign, and that inversion can recover input prompts with high fidelity, sometimes matching verbalization performance. The work argues for targeted benchmarks and rigorous controls to assess whether verbalization yields meaningful, privileged insights into LLM operations, and it highlights the limitations of current datasets and evaluation designs for interpretability research.

Abstract

Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they can succeed at benchmarks without any access to target model internals, suggesting that these datasets may not be ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the knowledge of the target LLM whose activations are decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.

Do Natural Language Descriptions of Model Activations Convey Privileged Information?

TL;DR

This paper questions whether natural language descriptions of LLM activations (activation verbalization) truly reveal privileged internal knowledge or merely mirror input information. Through controlled experiments with Patchscopes and LatentQATeachingllmsdecode-style verbalizers, inversion-based reconstructions, and the novel PersonaQA benchmarks, the authors show that many verbalization evaluations do not require access to target activations and can be solved using input alone or the verbalizer's own knowledge. They demonstrate that verbalizers often reflect their own world knowledge, especially when M1 and M2 knowledge misalign, and that inversion can recover input prompts with high fidelity, sometimes matching verbalization performance. The work argues for targeted benchmarks and rigorous controls to assess whether verbalization yields meaningful, privileged insights into LLM operations, and it highlights the limitations of current datasets and evaluation designs for interpretability research.

Abstract

Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they can succeed at benchmarks without any access to target model internals, suggesting that these datasets may not be ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the knowledge of the target LLM whose activations are decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.

Paper Structure

This paper contains 86 sections, 6 figures, 32 tables.

Figures (6)

  • Figure 1: Two ways that a verbalizer ($\mathcal{M}_2$) might describe an activation. In our preferred scenario (a), the description employs privileged information beyond what is accessible in the input ($x_{\textrm{input}}$), so the country of origin for Alice can be determined from the target ($\mathcal{M}_1$) model's activations. Alternatively, (b) verbalization may give no privileged insights into the operations of $\mathcal{M}_1$ since $\mathcal{M}_2$ may only be accessing input text information from $\mathcal{M}_1$, and so $\mathcal{M}_2$ can only answer based on its own knowledge about Alice.
  • Figure 2: Two ways of verbalizing descriptions of model activations. In (a), Patchscopesghandeharioun2024patchscopes and SelfIEchen-etal-selfie-2024 both patch the last token representation from target model $\mathcal{M}_1$ into the interpretation prompt and use $\mathcal{M}_2$ to verbalize this activation. In (b), LIT pan2024latentqateachingllmsdecode patches an activation matrix from a layer ($N$ tokens) of $\mathcal{M}_1$ into $\mathcal{M}_2$.
  • Figure 3: We use the following setup to assess whether verbalization techniques communicate privileged information, or if they merely describe input texts. (a) An activation from target model $\mathcal{M}_1$ is directly inverted with $\mathcal{M}_{\text{rec}}$, a separate model trained to do this. (b) We pass this (possibly imperfect) reconstruction $x_{\text{rec}}$ and $x_{\text{prompt}}$ to $\mathcal{M}_2$ to make a prediction, without access to $\mathcal{M}_1$ activations. Finally, (c) we obtain the output from $\mathcal{M}_2$, which is a zeroshot judgment of the inverted input and the prompt, combined. Note that $\mathcal{M}_2$ is in this case an instruction-tuned model not trained on activations (though here, when paired with $\mathcal{M}_{\text{rec}}$, we use the notation interchangeably).
  • Figure 4: We show the effect of using an $x_{\text{prompt}}$ that is semantically similar or adversarial. We average across all tasks and tested prompts for space; see Appendix Subsection \ref{['appendix:verbalization_prompts']} for the full prompt and task breakdown.
  • Figure 5: We show the effects of small prompt manipulations. For both LIT and Patchscopes, we verbalize $\ell = 15$. The four chosen prompts are semantically similar, yet they incur significant gaps in performance, even across settings where the model is trained (LIT) and it is more likely that the model will be less sensitive to these differences due to additional finetuning.
  • ...and 1 more figures