Table of Contents
Fetching ...

Are Latent Reasoning Models Easily Interpretable?

Connor Dilgren, Sarah Wiegreffe

Abstract

Latent reasoning models (LRMs) have attracted significant research interest due to their low inference cost (relative to explicit reasoning models) and theoretical ability to explore multiple reasoning paths in parallel. However, these benefits come at the cost of reduced interpretability: LRMs are difficult to monitor because they do not reason in natural language. This paper presents an investigation into LRM interpretability by examining two state-of-the-art LRMs. First, we find that latent reasoning tokens are often unnecessary for LRMs' predictions; on logical reasoning datasets, LRMs can almost always produce the same final answers without using latent reasoning at all. This underutilization of reasoning tokens may partially explain why LRMs do not consistently outperform explicit reasoning methods and raises doubts about the stated role of these tokens in prior work. Second, we demonstrate that when latent reasoning tokens are necessary for performance, we can decode gold reasoning traces up to 65-93% of the time for correctly predicted instances. This suggests LRMs often implement the expected solution rather than an uninterpretable reasoning process. Finally, we present a method to decode a verified natural language reasoning trace from latent tokens without knowing a gold reasoning trace a priori, demonstrating that it is possible to find a verified trace for a majority of correct predictions but only a minority of incorrect predictions. Our findings highlight that current LRMs largely encode interpretable processes, and interpretability itself can be a signal of prediction correctness.

Are Latent Reasoning Models Easily Interpretable?

Abstract

Latent reasoning models (LRMs) have attracted significant research interest due to their low inference cost (relative to explicit reasoning models) and theoretical ability to explore multiple reasoning paths in parallel. However, these benefits come at the cost of reduced interpretability: LRMs are difficult to monitor because they do not reason in natural language. This paper presents an investigation into LRM interpretability by examining two state-of-the-art LRMs. First, we find that latent reasoning tokens are often unnecessary for LRMs' predictions; on logical reasoning datasets, LRMs can almost always produce the same final answers without using latent reasoning at all. This underutilization of reasoning tokens may partially explain why LRMs do not consistently outperform explicit reasoning methods and raises doubts about the stated role of these tokens in prior work. Second, we demonstrate that when latent reasoning tokens are necessary for performance, we can decode gold reasoning traces up to 65-93% of the time for correctly predicted instances. This suggests LRMs often implement the expected solution rather than an uninterpretable reasoning process. Finally, we present a method to decode a verified natural language reasoning trace from latent tokens without knowing a gold reasoning trace a priori, demonstrating that it is possible to find a verified trace for a majority of correct predictions but only a minority of incorrect predictions. Our findings highlight that current LRMs largely encode interpretable processes, and interpretability itself can be a signal of prediction correctness.

Paper Structure

This paper contains 38 sections, 2 equations, 18 figures, 12 tables, 2 algorithms.

Figures (18)

  • Figure 1: An overview of our findings. Left: LRMs tend to commit to a final answer before exhausting their budget, indicating that they don't effectively use all available reasoning tokens. Middle: Vocabulary projections of latent tokens often encode gold reasoning traces, suggesting that the model follows an interpretable reasoning trace rather than an opaque one. Right: We can generate candidate steps encoded by a latent token, and verify them by checking whether vocabulary projections changes as expected under modified prompts.
  • Figure 2: Early stopping results. Solid bars indicate the first match percentage, while hatched bars show the additional reasoning required for a stable match (black lines for one standard deviation), compared to the model's full reasoning trace (RT).
  • Figure 3: Relative performance of latent reasoning versus non-reasoning and explicit reasoning for the multi-reasoning models. Note: the x-axis scales differ to improve readability.
  • Figure 4: Found gold reasoning trace in Coconut + GPT-2 Small's vocabulary projections, from instance 220 of GSM8k-Aug's test split. The model answered this question correctly.
  • Figure 5: Backtracking results. "Any Gold RT" includes additional solutions from the MultiChain GSM8k-Aug dataset deng2025latentreasoningllmsvocabularyspace. Solid bars exclude question tokens as operands, and hatched bars show the increase from including them as candidate operands.
  • ...and 13 more figures