Table of Contents
Fetching ...

Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics

Amir Zur, Atticus Geiger, Ekdeep Singh Lubana, Eric Bigelow

TL;DR

Are language models aware of the road not taken? investigates token-level uncertainty and hidden-state dynamics during chain-of-thought reasoning. It introduces Forking Paths Analysis to quantify per-token outcome uncertainty and demonstrates a measurable correlation between uncertainty and steerability when performing hidden-state interventions that add a steering vector $s_t$ to hidden activations. It further shows that linear probes can predict the future outcome distribution $o_t$ from hidden states $h_t$, with in-model activations providing stronger signals than cross-model embeddings. Collectively, the findings enable more efficient uncertainty estimation and targeted control of reasoning in reasoning LLMs, with implications for safety and interpretability.

Abstract

When a language model generates text, the selection of individual tokens might lead it down very different reasoning paths, making uncertainty difficult to quantify. In this work, we consider whether reasoning language models represent the alternate paths that they could take during generation. To test this hypothesis, we use hidden activations to control and predict a language model's uncertainty during chain-of-thought reasoning. In our experiments, we find a clear correlation between how uncertain a model is at different tokens, and how easily the model can be steered by controlling its activations. This suggests that activation interventions are most effective when there are alternate paths available to the model -- in other words, when it has not yet committed to a particular final answer. We also find that hidden activations can predict a model's future outcome distribution, demonstrating that models implicitly represent the space of possible paths.

Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics

TL;DR

Are language models aware of the road not taken? investigates token-level uncertainty and hidden-state dynamics during chain-of-thought reasoning. It introduces Forking Paths Analysis to quantify per-token outcome uncertainty and demonstrates a measurable correlation between uncertainty and steerability when performing hidden-state interventions that add a steering vector to hidden activations. It further shows that linear probes can predict the future outcome distribution from hidden states , with in-model activations providing stronger signals than cross-model embeddings. Collectively, the findings enable more efficient uncertainty estimation and targeted control of reasoning in reasoning LLMs, with implications for safety and interpretability.

Abstract

When a language model generates text, the selection of individual tokens might lead it down very different reasoning paths, making uncertainty difficult to quantify. In this work, we consider whether reasoning language models represent the alternate paths that they could take during generation. To test this hypothesis, we use hidden activations to control and predict a language model's uncertainty during chain-of-thought reasoning. In our experiments, we find a clear correlation between how uncertain a model is at different tokens, and how easily the model can be steered by controlling its activations. This suggests that activation interventions are most effective when there are alternate paths available to the model -- in other words, when it has not yet committed to a particular final answer. We also find that hidden activations can predict a model's future outcome distribution, demonstrating that models implicitly represent the space of possible paths.

Paper Structure

This paper contains 5 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: Our experimental set-up. By intervening on the generated tokens, we create branching paths to estimate the model's outcome distribution. By intervening on the model's activations, we steer the base generation towards a desired outcome.
  • Figure 2: Comparison of the model outcome distribution $o_t$ (top) and steering success (bottom) across tokens. The outcome distribution and steering success have similar dynamics, with the same change points detected by the CPD algorithm (highlighted text).
  • Figure 3: Correlation between steering success ($y$-axis) and base outcome probability ($x$-axis) across token positions.
  • Figure 4: Our experimental set-up for Section \ref{['sec:probe']}. At every token position $t$, we train a linear probe to predict the distribution of outcomes $o_t$ from re-sampled paths starting at $t$, given the hidden representation $h_t$ over that token.
  • Figure 5: KL loss (lower is better) for linear probes predicting the outcome distribution of Llama from the hidden representations of Llama (blue) and Gemma (green) at the same token mid-generation. Low loss suggests that hidden states over chain-of-thought text are predictive of Llama's outcome distribution.
  • ...and 1 more figures