Table of Contents
Fetching ...

Hybrid Latent Reasoning via Reinforcement Learning

Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, Dong Wang

TL;DR

Latent reasoning offers a promising alternative to chain-of-thought but is hard to integrate with pretrained LLMs. The authors propose HRPO, an RL-based framework that blends discrete token sampling with continuous hidden states through a learnable gating mechanism, enabling hybrid latent reasoning without CoT traces while preserving generation. HRPO demonstrates consistent gains on knowledge and STEM benchmarks, often matching or exceeding larger models, and reveals interpretable patterns such as cross-lingual reasoning and efficiency improvements. These results suggest a viable route to scalable, interpretable latent reasoning in LLMs and open avenues for further RL-based latent space learning.

Abstract

Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.

Hybrid Latent Reasoning via Reinforcement Learning

TL;DR

Latent reasoning offers a promising alternative to chain-of-thought but is hard to integrate with pretrained LLMs. The authors propose HRPO, an RL-based framework that blends discrete token sampling with continuous hidden states through a learnable gating mechanism, enabling hybrid latent reasoning without CoT traces while preserving generation. HRPO demonstrates consistent gains on knowledge and STEM benchmarks, often matching or exceeding larger models, and reveals interpretable patterns such as cross-lingual reasoning and efficiency improvements. These results suggest a viable route to scalable, interpretable latent reasoning in LLMs and open avenues for further RL-based latent space learning.

Abstract

Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs' generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.

Paper Structure

This paper contains 15 sections, 6 equations, 26 figures, 9 tables.

Figures (26)

  • Figure 1: Comparison between discrete reasoning (left) and latent reasoning (right). Unlike the autoregressive sampling process in discrete reasoning, latent reasoning incorporates hidden representations from previous steps to enhance reasoning performance (between <think> and </think>).
  • Figure 2: Hybrid reasoning with gating (left) and hybrid reasoning policy optimization (right). During rollouts, the reasoning trajectory is generated hybridly with both discrete tokens and latent features, and for policy update, we compute the HRPO loss using the hybrid rollout buffer to update the model.
  • Figure 3: Reward on MATH for Qwen-2.5-1.5B using different latent reasoning strategies.
  • Figure 4: Hidden ratio with varying $r_{\mathrm{min}}$ in $\texttt{exp}(-c \cdot \texttt{softplus}(\Lambda))$ and learning rate. We visualize the hidden ratio and completion length for training runs with $r_{\mathrm{min}}$ from $[0.95, 0.98, 0.99]$.
  • Figure 5: Sensitivity analysis for temperature $\tau$ in \ref{['eq:hidden_states']}. We visualize the reward and completion length for training runs with different temperature selected from $[0.3, 0.5, 0.7, 0.9]$.
  • ...and 21 more figures