Table of Contents
Fetching ...

ParaScopes: What do Language Models Activations Encode About Future Text?

Nicky Pochinkov, Yulia Volkova, Anna Vasileva, Sai V R Chereddy

TL;DR

This work tests whether language models encode forward-looking information in their activations by formalizing a Planning Decodability Hypothesis and introducing Residual Stream Decoders (ParaScopes) to extract future content from a fixed residual stream $R_i\in\mathbb{R}^{L\times d}$. It presents two decoding approaches—Continuation ParaScope and Text AutoEncoder ParaScope (TAE ParaScope), plus an outline variant—evaluated on Llama-3.2-3B-Instruct with a large synthetic paragraph dataset, using cosine similarity, BLEURT, and LLM-based judgments. The results show decodable information about upcoming paragraphs, roughly comparable to about $5$ tokens of lookahead in small models, with stronger subject retention for TAE and stronger detail retention for Continuation in certain cases; outline-level decoding is weaker. Together, the findings support the decodability of paragraph-scale planning and offer a framework for monitoring longer-horizon information in LLMs, pointing to middle-layer dynamics and boundary-driven planning as key factors.

Abstract

Interpretability studies in language models often investigate forward-looking representations of activations. However, as language models become capable of doing ever longer time horizon tasks, methods for understanding activations often remain limited to testing specific concepts or tokens. We develop a framework of Residual Stream Decoders as a method of probing model activations for paragraph-scale and document-scale plans. We test several methods and find information can be decoded equivalent to 5+ tokens of future context in small models. These results lay the groundwork for better monitoring of language models and better understanding how they might encode longer-term planning information.

ParaScopes: What do Language Models Activations Encode About Future Text?

TL;DR

This work tests whether language models encode forward-looking information in their activations by formalizing a Planning Decodability Hypothesis and introducing Residual Stream Decoders (ParaScopes) to extract future content from a fixed residual stream . It presents two decoding approaches—Continuation ParaScope and Text AutoEncoder ParaScope (TAE ParaScope), plus an outline variant—evaluated on Llama-3.2-3B-Instruct with a large synthetic paragraph dataset, using cosine similarity, BLEURT, and LLM-based judgments. The results show decodable information about upcoming paragraphs, roughly comparable to about tokens of lookahead in small models, with stronger subject retention for TAE and stronger detail retention for Continuation in certain cases; outline-level decoding is weaker. Together, the findings support the decodability of paragraph-scale planning and offer a framework for monitoring longer-horizon information in LLMs, pointing to middle-layer dynamics and boundary-driven planning as key factors.

Abstract

Interpretability studies in language models often investigate forward-looking representations of activations. However, as language models become capable of doing ever longer time horizon tasks, methods for understanding activations often remain limited to testing specific concepts or tokens. We develop a framework of Residual Stream Decoders as a method of probing model activations for paragraph-scale and document-scale plans. We test several methods and find information can be decoded equivalent to 5+ tokens of future context in small models. These results lay the groundwork for better monitoring of language models and better understanding how they might encode longer-term planning information.

Paper Structure

This paper contains 49 sections, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Simple diagram showing the idea behind ParaScopes. The residual stream of an LLM is taken at a specific point, and we try to use ParaScope methods to infer what the LLM might say next.
  • Figure 2: Continuation Parascope (Left) and TAE Parascope (Right). The former takes the whole residual stream of the model and passes it into a blank-context copy of the model for decoding. The latter takes the residual stream of the model and trains a map to output a text autoencoder vector.
  • Figure 3: Basic diagram explaining the next-paragraph prediction task (left) and showing how we produce the baseline generation and cheat-k predictions (right)
  • Figure 4: Violin plots showing the performance of TAE ParaScope and Continuation ParaScope against the baselines (0, 1, 5, and 10 cheat tokens) and ground truth (regenerated, auto-decoded) on Cosine sim (left) and BLEURT (right)
  • Figure 5: Cumulative bars showing the performance of TAE ParaScope and Continuation ParaScope against the baselines (0, 1, 5, and 10 cheat tokens) and ground truth (regenerated, auto-decoded). (left) shows subject match to original paragraph on a scale from -1 to 4, and (right) shows detail preservation on a scale from -1 to 3.
  • ...and 9 more figures