Table of Contents
Fetching ...

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

Alexander Pan, Lijie Chen, Jacob Steinhardt

TL;DR

The paper addresses the opacity of latent representations in large language models by introducing LatentQA, a framework for open-ended natural-language QA about activations. It introduces Latent Interpretation Tuning (LIT), which finetunes a decoder to map activations to QA pairs, enabling both reading of latent information and control of model behavior. A three-part LatentQA data pipeline (control, stimulus, stimulus+completion) with activation masking and data augmentation yields robust decoder generalization, capable of debiasing, controllable sentiment generation, and auditing harmful capabilities. Scaling experiments show performance improves with larger models and more data, underscoring the approach’s potential for robust interpretability and safer, more controllable LLM deployment.

Abstract

Interpretability methods seek to understand language model representations, yet the outputs of most such methods -- circuits, vectors, scalars -- are not immediately human-interpretable. In response, we introduce LatentQA, the task of answering open-ended questions about model activations in natural language. Towards solving LatentQA, we propose Latent Interpretation Tuning (LIT), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs, similar to how visual instruction tuning trains on question-answer pairs associated with images. We use the decoder for diverse reading applications, such as extracting relational knowledge from representations or uncovering system prompts governing model behavior. Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations. Finally, we extend LatentQA to reveal harmful model capabilities, such as generating recipes for bioweapons and code for hacking.

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

TL;DR

The paper addresses the opacity of latent representations in large language models by introducing LatentQA, a framework for open-ended natural-language QA about activations. It introduces Latent Interpretation Tuning (LIT), which finetunes a decoder to map activations to QA pairs, enabling both reading of latent information and control of model behavior. A three-part LatentQA data pipeline (control, stimulus, stimulus+completion) with activation masking and data augmentation yields robust decoder generalization, capable of debiasing, controllable sentiment generation, and auditing harmful capabilities. Scaling experiments show performance improves with larger models and more data, underscoring the approach’s potential for robust interpretability and safer, more controllable LLM deployment.

Abstract

Interpretability methods seek to understand language model representations, yet the outputs of most such methods -- circuits, vectors, scalars -- are not immediately human-interpretable. In response, we introduce LatentQA, the task of answering open-ended questions about model activations in natural language. Towards solving LatentQA, we propose Latent Interpretation Tuning (LIT), which finetunes a decoder LLM on a dataset of activations and associated question-answer pairs, similar to how visual instruction tuning trains on question-answer pairs associated with images. We use the decoder for diverse reading applications, such as extracting relational knowledge from representations or uncovering system prompts governing model behavior. Our decoder also specifies a differentiable loss that we use to control models, such as debiasing models on stereotyped sentences and controlling the sentiment of generations. Finally, we extend LatentQA to reveal harmful model capabilities, such as generating recipes for bioweapons and code for hacking.

Paper Structure

This paper contains 63 sections, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Curating and training on LatentQA data. (1): We curate LatentQA data by prompting the target LLM with a control prepended to a stimulus and capture activations from the stimulus. We also ask GPT to generate QA pairs about the control. (2): We train our decoder LLM, a copy of the target LLM, by patching in activations from the stimulus and finetuning the decoder to minimize the cross-entropy loss on the QA pairs.
  • Figure 2: Our LatentQA data generation pipeline. (1): Given a category of controls, we prompt OpenAI's o1-preview openai2024o1 to generate seed controls in that category. (2): Given a seed control, we ask o1 to generate a synthetic control, stimulus, and completion. We use o1 as we find that it is better able to follow the control than the target LLM. (3): We ask o1 to generate description-based and reasoning-based QA pairs about the control.
  • Figure 3: The LatentQA data used in Lit. The top block shows an example control, stimulus, and completion. The bottom block shows the three types of LatentQA data generated from the example.
  • Figure 4: Reading with LatentQA. We can read model activations on the current user prompt (in green) to predict properties of future model completions, e.g., learning about the model's persona.
  • Figure 5: Control with LatentQA. Given an [Act] and a control specified as a QA pair, the decoder provides a gradient (in red) to the target LLM, altering its responses, e.g., causing it to choose a vegan dish.
  • ...and 10 more figures