Table of Contents
Fetching ...

GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler

Minghan Wang, Ye Bai, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari

TL;DR

GTS predicts context-dependent perturbation distributions over continuous reasoning states and is trained with GRPO-style policy optimization while keeping the backbone frozen, showing that GTS achieves more reliable inference-time scaling than heuristic baselines.

Abstract

Inference-time scaling (ITS) in latent reasoning models typically introduces stochasticity through heuristic perturbations, such as dropout or fixed Gaussian noise. While these methods increase trajectory diversity, their exploration behavior is not explicitly modeled and can be inefficient under finite sampling budgets. We observe that stronger perturbations do not necessarily translate into more effective candidate trajectories, as unguided noise may disrupt internal decision structure rather than steer it. To provide a more structured alternative, we model latent thought exploration as conditional sampling from learnable densities and instantiate this idea as a Gaussian Thought Sampler (GTS). GTS predicts context-dependent perturbation distributions over continuous reasoning states and is trained with GRPO-style policy optimization while keeping the backbone frozen. Experiments on GSM8K with two latent reasoning architectures show that GTS achieves more reliable inference-time scaling than heuristic baselines. These findings indicate that improving latent ITS requires structured and optimizable exploration mechanisms rather than simply amplifying stochasticity.

GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler

TL;DR

GTS predicts context-dependent perturbation distributions over continuous reasoning states and is trained with GRPO-style policy optimization while keeping the backbone frozen, showing that GTS achieves more reliable inference-time scaling than heuristic baselines.

Abstract

Inference-time scaling (ITS) in latent reasoning models typically introduces stochasticity through heuristic perturbations, such as dropout or fixed Gaussian noise. While these methods increase trajectory diversity, their exploration behavior is not explicitly modeled and can be inefficient under finite sampling budgets. We observe that stronger perturbations do not necessarily translate into more effective candidate trajectories, as unguided noise may disrupt internal decision structure rather than steer it. To provide a more structured alternative, we model latent thought exploration as conditional sampling from learnable densities and instantiate this idea as a Gaussian Thought Sampler (GTS). GTS predicts context-dependent perturbation distributions over continuous reasoning states and is trained with GRPO-style policy optimization while keeping the backbone frozen. Experiments on GSM8K with two latent reasoning architectures show that GTS achieves more reliable inference-time scaling than heuristic baselines. These findings indicate that improving latent ITS requires structured and optimizable exploration mechanisms rather than simply amplifying stochasticity.
Paper Structure (69 sections, 20 equations, 5 figures, 2 tables)

This paper contains 69 sections, 20 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Comparison of inference strategies in a conceptual likelihood space. Deterministic inference produces a single trajectory, resulting in limited exploration. Dropout-based sampling generates multiple trajectories with high diversity but substantial noise. In contrast, Gaussian Thought Sampling produces structured trajectories concentrated in high-likelihood regions around the ground truth.
  • Figure 2: ITS performance under different latent sampling strategies. Pass@N on COCONUT (left) and CODI (right) as a function of the number of sampled reasoning trajectories $N$. All methods coincide at $N=1$, corresponding to deterministic latent reasoning. As $N$ increases, GTS achieves stronger scaling behavior than dropout-based sampling and standard Gaussian noise (StandardG), indicating more effective exploration of the latent reasoning space.
  • Figure 3: Average number of unique decoded answers per prompt. Left: COCONUT. Right: CODI.
  • Figure 4: Step-wise distribution of signal-to-noise ratio (SNR) across latent reasoning steps. Top: COCONUT. Bottom: CODI. Each violin shows the distribution over prompts; markers denote medians.
  • Figure 5: Ablation study on the reward shaping. Pass@N performance of the GTS trained with the accuracy-only reward and the dense reward introduced in \ref{['sec:method:train_new']}.