ParaScopes: What do Language Models Activations Encode About Future Text?
Nicky Pochinkov, Yulia Volkova, Anna Vasileva, Sai V R Chereddy
TL;DR
This work tests whether language models encode forward-looking information in their activations by formalizing a Planning Decodability Hypothesis and introducing Residual Stream Decoders (ParaScopes) to extract future content from a fixed residual stream $R_i\in\mathbb{R}^{L\times d}$. It presents two decoding approaches—Continuation ParaScope and Text AutoEncoder ParaScope (TAE ParaScope), plus an outline variant—evaluated on Llama-3.2-3B-Instruct with a large synthetic paragraph dataset, using cosine similarity, BLEURT, and LLM-based judgments. The results show decodable information about upcoming paragraphs, roughly comparable to about $5$ tokens of lookahead in small models, with stronger subject retention for TAE and stronger detail retention for Continuation in certain cases; outline-level decoding is weaker. Together, the findings support the decodability of paragraph-scale planning and offer a framework for monitoring longer-horizon information in LLMs, pointing to middle-layer dynamics and boundary-driven planning as key factors.
Abstract
Interpretability studies in language models often investigate forward-looking representations of activations. However, as language models become capable of doing ever longer time horizon tasks, methods for understanding activations often remain limited to testing specific concepts or tokens. We develop a framework of Residual Stream Decoders as a method of probing model activations for paragraph-scale and document-scale plans. We test several methods and find information can be decoded equivalent to 5+ tokens of future context in small models. These results lay the groundwork for better monitoring of language models and better understanding how they might encode longer-term planning information.
