Do language models plan ahead for future tokens?
Wilson Wu, John X. Morris, Lionel Levine
TL;DR
The paper addresses whether language models intentionally plan ahead by storing information at time $t$ that benefits future tokens, formalizing two hypotheses: pre-caching and breadcrumbs. It introduces myopic training to suppress gradient flow to past timesteps and uses synthetic data to demonstrate clear pre-caching, while natural-language experiments with GPT-2 indicate breadcrumbs dominate at small scale; scaling up to larger models increases pre-caching. The findings suggest a predominantly breadcrumb-driven pattern in small models, with scalable pre-caching emerging as models grow, indicating a form of future planning in large transformers. These insights have implications for interpretability and safety, and point to avenues for controlling or leveraging future-token planning in practice.
Abstract
Do transformers "think ahead" during inference at a given position? It is known transformers prepare information in the hidden states of the forward pass at time step $t$ that is then used in future forward passes $t+τ$. We posit two explanations for this phenomenon: pre-caching, in which off-diagonal gradient terms present during training result in the model computing features at $t$ irrelevant to the present inference task but useful for the future, and breadcrumbs, in which features most relevant to time step $t$ are already the same as those that would most benefit inference at time $t+τ$. We test these hypotheses by training language models without propagating gradients to past timesteps, a scheme we formalize as myopic training. In a constructed synthetic data setting, we find clear evidence for pre-caching. In the autoregressive language modeling setting, our experiments are more suggestive of the breadcrumbs hypothesis, though pre-caching increases with model scale.
