Table of Contents
Fetching ...

Parallel Test-Time Scaling for Latent Reasoning Models

Runyang You, Yongqi Li, Meng Liu, Wenjie Wang, Liqiang Nie, Wenjie Li

TL;DR

This work extends parallel test-time scaling to latent reasoning models by introducing two uncertainty-driven sampling strategies—Monte Carlo Dropout and Additive Gaussian Noise—and a dedicated trajectory scorer, LatentRM, trained with a step-wise contrastive objective. The proposed framework enables scalable, parallel inference in continuous latent spaces and provides insights into the exploration dynamics of stochastic latent reasoning. Empirical results demonstrate that both sampling methods scale with compute and that LatentRM enables effective trajectory selection across budgets, with distinct diversity-coverage trade-offs. These findings establish a foundation for scalable latent inference and point to future directions like integrating sampling and aggregation into reinforcement learning for adaptive, compute-aware reasoning.

Abstract

Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation. \ This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code released at https://github.com/YRYangang/LatentTTS.

Parallel Test-Time Scaling for Latent Reasoning Models

TL;DR

This work extends parallel test-time scaling to latent reasoning models by introducing two uncertainty-driven sampling strategies—Monte Carlo Dropout and Additive Gaussian Noise—and a dedicated trajectory scorer, LatentRM, trained with a step-wise contrastive objective. The proposed framework enables scalable, parallel inference in continuous latent spaces and provides insights into the exploration dynamics of stochastic latent reasoning. Empirical results demonstrate that both sampling methods scale with compute and that LatentRM enables effective trajectory selection across budgets, with distinct diversity-coverage trade-offs. These findings establish a foundation for scalable latent inference and point to future directions like integrating sampling and aggregation into reinforcement learning for adaptive, compute-aware reasoning.

Abstract

Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation. \ This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code released at https://github.com/YRYangang/LatentTTS.

Paper Structure

This paper contains 47 sections, 31 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Sampling mechanisms for token-based generation (\ref{['fig:main-token']}) and our proposed approaches for the latent setting (\ref{['fig:main-latent+dropout']} & \ref{['fig:main-latent+noise']}). (\ref{['fig:main-token']}): Multinomial sampling over token probabilities at each step. (\ref{['fig:main-latent+dropout']}): Monte Carlo Dropout (MC-dropout): Stochastic inference via randomly sampled dropout masks. (\ref{['fig:main-latent+noise']}): Additive Gaussian Noise (AGN): independent Gaussian perturbations injected to each latent thought.
  • Figure 2: Coverage (%) versus N plot for COCONUT, CODI, and CoLaR on GSM-Test (\ref{['fig:sampling:gsm-test']}), GSM-Hard (\ref{['fig:sampling:gsm-hard']}) and MultiArith (\ref{['fig:sampling:gsm-multiarith']}). Each subplot compares MC-dropout and AGN using the optimal hyperparameter. Higher coverage indicates a larger fraction of problems solved by $N$ attempts. Results are reported as the mean over three runs.
  • Figure 3: Coverage versus diversity for MC-dropout (red) and AGN (blue) with $N \in \{4, 8, 16, 32\}$ by sweeping $p$ and $\sigma$ to span a range of diversity values. Darker shades indicate larger $N$. Results are shown for COCONUT (left) and CODI (right) on GSM-Test.
  • Figure 4: Diversity of latent trajectories across reasoning steps on GSM-Test with COCONUT. Left: MC-dropout ($p = 0.1$ - $0.5$). Right: AGN ($\sigma = 0.1$ - $0.5$).
  • Figure 5: t-SNE visualization of latent thoughts sampled with different dropout rates (red; $p$ from light to dark) and Gaussian-noise scales (blue; $\sigma$ from light to dark). The green marker denotes the deterministic latent thought (no stochasticity). Diamonds ($\diamond$) indicate correct reasoning trajectories; crosses ($\bm{\times}$) indicate incorrect ones. (\ref{['fig:sub-easy']}): an easy question. (\ref{['fig:sub-hard']}): a hard question.
  • ...and 1 more figures