Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

Keenan Pepper; Alex McKenzie; Florin Pop; Stijn Servaes; Martin Leitgab; Mike Vaiana; Judd Rosenblatt; Michael S. A. Graziano; Diogo de Lucena

Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

Keenan Pepper, Alex McKenzie, Florin Pop, Stijn Servaes, Martin Leitgab, Mike Vaiana, Judd Rosenblatt, Michael S. A. Graziano, Diogo de Lucena

TL;DR

This work addresses unreliable self-interpretations produced by prompting language models to describe their own internals, proposing a freezing-based solution: train lightweight adapters on interpretability artifacts while keeping the base model frozen. Using Patchscopes-style activation patching, adapters map activation vectors to token embeddings, with the scalar affine architecture ($d_ ext{model}+1$ parameters) delivering most of the gains; full-rank adapters overfit on SAE data, while low-rank extensions offer measurable improvements. Across diverse data sources (SAE features and Wikipedia contrastive vectors) and model families (Llama, Gemma, Qwen), trained adapters yield reliable self-interpretations, scale with model size, and enable decoding implicit reasoning (e.g., bridge entities) without chain-of-thought. The approach demonstrates strong cross-dataset and cross-model generalization, preserves the original model without fine-tuning, and provides a practical path toward verifiable self-interpretation and model auditing at scale.

Abstract

Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (71% vs 63% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi-hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self-interpretation improves with scale, without modifying the model being interpreted.

Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

TL;DR

parameters) delivering most of the gains; full-rank adapters overfit on SAE data, while low-rank extensions offer measurable improvements. Across diverse data sources (SAE features and Wikipedia contrastive vectors) and model families (Llama, Gemma, Qwen), trained adapters yield reliable self-interpretations, scale with model size, and enable decoding implicit reasoning (e.g., bridge entities) without chain-of-thought. The approach demonstrates strong cross-dataset and cross-model generalization, preserves the original model without fine-tuning, and provides a practical path toward verifiable self-interpretation and model auditing at scale.

Abstract

parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (71% vs 63% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi-hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self-interpretation improves with scale, without modifying the model being interpreted.

Paper Structure (77 sections, 17 figures, 17 tables)

This paper contains 77 sections, 17 figures, 17 tables.

Introduction
Methods
Background: Self-Interpretation via Patching
Trained Adapters
Training Data: Interpretability Artifacts
Training Objective
Experiments
Setup
Models.
Datasets.
Evaluation.
Contrastive Activation Vector Results
Qualitative example.
Self-Interpretation Scales with Model Size
SAE Evaluation
...and 62 more sections

Figures (17)

Figure 1: Training self-interpretation from interpretability artifacts. In this case, the "interpretability artifact" consists of $(h, y)$ pairs where the vector $h$ is a contrastive activation vector from the source prompt about a specific topic, and the label $y$ is one of several synthetic descriptions of that topic. The activation $h$ is extracted from layer $\ell$ at the final token position of the source prompt (the \\ n\\ n following the chat template's assistant header). A lightweight adapter transforms $h$ and injects it at layer 0 at the placeholder position of an explanation-seeking target prompt. Cross-entropy loss on the label tokens trains only the $d_{\text{model}}{+}1$ adapter parameters; the language model remains frozen. The case of training on an SAE dataset is similar except that $h$ is an SAE decoder vector and $y$ is a natural language feature label, e.g. from automated interpretability.
Figure 2: Scaling comparison on Qwen-2.5 models (7B to 72B). Trained SelfIE (below): recall@100 on held-out topics for full-rank adapters trained on contrastive topic vectors from the middle half of each model's layers. Taboo baseline (above): the model describes each topic without naming it, scored with the same embedding retrieval. While SelfIE consistently performs below the Taboo ceiling, the gap decreases with model scale as SelfIE's performance increases more rapidly. Error bars show 95% confidence intervals. See Appendix \ref{['app:scaling']} for additional metrics.
Figure 3: Bridge entity detection across layers and token positions. Each cell shows the fraction of SelfIE generations (temperature 0.7, 10 samples per cell) containing any alias of the bridge entity (e.g., "Plato" for the prompt "The author of The Republic was born in the city of"). Position 0 is aligned to the first token where detection exceeds 0.1%; negative positions are earlier context. Top: Untrained SelfIE shows weak, localized signal. Bottom: Trained adapter (scalar affine) produces stronger detection rates over a broader range of layers and positions. Aggregated over 500 randomly sampled TwoHopFact prompts where the language model answers both two-hop and first-hop questions correctly when instructed to answer immediately with no CoT.
Figure 4: Histograms showing the distribution of the number of scales (out of 6 scales attempted) at which each method produced accurate labels (where "accurate" is defined as eliciting at least one nonzero activation in 10 trials of generation scoring). The trained adapter is less sensitive to scale, with more latents receiving accurate labels at all 6 scales.
Figure 5: Validation loss curves during training on Llama Scope SAE features. Full-rank adapters (a) achieve lower training loss but higher validation loss than scalar affine + low-rank adapters (b), demonstrating overfitting. The train-loss gap only appears after the first epoch, but actually the full-rank adapter is already underperforming at the end of the first epoch (validation loss 1.691 rather than 1.661).
...and 12 more figures

Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

TL;DR

Abstract

Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

Authors

TL;DR

Abstract

Table of Contents

Figures (17)