Table of Contents
Fetching ...

Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts

Hanwen Du, Yuxin Dong, Xia Ning

TL;DR

This work shows that Huginn-3.5B's latent thinking trajectories encode signals predictive of answer correctness, enabling supervision directly in latent space. It introduces Latent Thinking Optimization (LTO), which uses a Latent Reward Model (LRM) to guide sampling of latent trajectories via a KL-regularized objective, with a closed-form reweighting and a probabilistic sampler. Across diverse math, coding, and commonsense tasks, LTO consistently improves correctness and demonstrates strong cross-domain transfer to general LLMs, all with high efficiency. The results suggest latent-space reward modeling as a general, domain-agnostic approach to boosting reasoning in large language models, offering an alternative to verb al reasoning that scales with lower cost.

Abstract

Large Language Models (LLMs) excel at problem solving by generating chain of thoughts in natural language, but such verbal thinking is computationally costly and prone to overthinking. Recent work instead proposes a latent thinking architecture Huginn-3.5B, which represents intermediate reasoning steps as sequence of latent representations. However, latent thoughts lack interpretability and are difficult to supervise, raising concerns about the correctness and reliability of its latent thinking processes. In this paper, we provide a systematic study of how Huginn-3.5B thinks in the latent space and how external supervision signals can improve its latent thinking processes. We show that latent thoughts leading to correct versus incorrect answers exhibit highly distinguishable patterns, and that a latent classifier can reliably predict answer correctness directly from latent thoughts. Leveraging these insights, we propose Latent Thinking Optimization (LTO), a probabilistic algorithm that employs the latent classifier as a Latent Reward Model (LRM) to optimize the latent thinking processes. Extensive experiments across diverse reasoning tasks demonstrate that LRM is highly effective in detecting incorrect latent thinking patterns, and LTO can significantly improve the latent thinking processes. Furthermore, we show that LRM can generalize across diverse domains, and LTO can be seamlessly applied to general LLMs to improve their thinking processes. In contrast to verbal thinking, our method demonstrates that reward modeling and scaling test-time thinking with supervision can be performed directly in the latent space, highlighting its potential as a general, efficient, and domain-agnostic approach to improving the thinking processes of LLMs.

Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts

TL;DR

This work shows that Huginn-3.5B's latent thinking trajectories encode signals predictive of answer correctness, enabling supervision directly in latent space. It introduces Latent Thinking Optimization (LTO), which uses a Latent Reward Model (LRM) to guide sampling of latent trajectories via a KL-regularized objective, with a closed-form reweighting and a probabilistic sampler. Across diverse math, coding, and commonsense tasks, LTO consistently improves correctness and demonstrates strong cross-domain transfer to general LLMs, all with high efficiency. The results suggest latent-space reward modeling as a general, domain-agnostic approach to boosting reasoning in large language models, offering an alternative to verb al reasoning that scales with lower cost.

Abstract

Large Language Models (LLMs) excel at problem solving by generating chain of thoughts in natural language, but such verbal thinking is computationally costly and prone to overthinking. Recent work instead proposes a latent thinking architecture Huginn-3.5B, which represents intermediate reasoning steps as sequence of latent representations. However, latent thoughts lack interpretability and are difficult to supervise, raising concerns about the correctness and reliability of its latent thinking processes. In this paper, we provide a systematic study of how Huginn-3.5B thinks in the latent space and how external supervision signals can improve its latent thinking processes. We show that latent thoughts leading to correct versus incorrect answers exhibit highly distinguishable patterns, and that a latent classifier can reliably predict answer correctness directly from latent thoughts. Leveraging these insights, we propose Latent Thinking Optimization (LTO), a probabilistic algorithm that employs the latent classifier as a Latent Reward Model (LRM) to optimize the latent thinking processes. Extensive experiments across diverse reasoning tasks demonstrate that LRM is highly effective in detecting incorrect latent thinking patterns, and LTO can significantly improve the latent thinking processes. Furthermore, we show that LRM can generalize across diverse domains, and LTO can be seamlessly applied to general LLMs to improve their thinking processes. In contrast to verbal thinking, our method demonstrates that reward modeling and scaling test-time thinking with supervision can be performed directly in the latent space, highlighting its potential as a general, efficient, and domain-agnostic approach to improving the thinking processes of LLMs.

Paper Structure

This paper contains 40 sections, 5 theorems, 21 equations, 8 figures, 7 tables, 2 algorithms.

Key Result

Theorem 1

Given a sampled set of $\{z_i\}^{N}_{i=1}$ to approximate the policy distribution $\pi^{*}(z|x)$, for each $i$, the solution to Equation eqn:KL_constrained_reward_optimization is $\pi_{r}(z_i|x)=\frac{\pi_\text{ref}(z_i\mid x)\exp\left(\frac{1}{\beta}r(x, z_i)\right)}{\sum^{N}_{j=1}\pi_\text{ref}(z_

Figures (8)

  • Figure 1: Visualization of the distribution of the correct and incorrect latent thoughts projected onto 3D space using PCA for dimension reduction. The arrows along the lines indicate the progression from the current step to the next step of the latent thought. More examples are in Appendix \ref{['appendix:examples_latent_thoughts']}.
  • Figure 2: Representation quality metrics of the latent thoughts on two datasets. The blue and red distributions represent the distributions for the correct and incorrect trajectory of latent thoughts, respectively. These metrics are calculated using all the samples from each dataset.
  • Figure 3: Performance of the latent classifier trained with varying numbers of latent thinking steps on the SVAMP and MBPP datasets. Additional metrics and results are available in Appendix \ref{['appendix:additional_results_classifier']}
  • Figure 4: Performance of LTO using different LRMs. "GSM-S" refers to the GSM-Symbolic dataset. "CQA" refers to the CommonsenseQA dataset. "None" refers to the performance of the base model without LTO.
  • Figure A1: Visualization of the distribution of the correct and incorrect latent thoughts projected onto 3D space demonstrate that correct and incorrect latent thoughts exhibit different patterns in the latent space. Note that this phenomenon is not limited to these cases. On the SVAMP dataset, we identify 1,654 problems with both correct and incorrect answers, and on the MBPP dataset, we identify 179 problems with both correct and incorrect answers. In all of these cases, the latent thoughts leading to correct versus incorrect answers show different patterns in the latent space.
  • ...and 3 more figures

Theorems & Definitions (9)

  • Theorem 1
  • Theorem 2
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Definition 1: Perfect reward model
  • Theorem 3
  • proof