Table of Contents
Fetching ...

Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

Williams Jonathan, Tureci Esin

TL;DR

This work addresses the misalignment between reinforcement learning signals and the multi-step latent reasoning processes in Looped Language Models. By introducing RLTT, which distributes credit across the entire latent thought trajectory via a trajectory-weighted policy gradient and optional KL regularization, the authors achieve substantial improvements on challenging mathematical benchmarks and notable transfer to non-mathematical reasoning. RLTT leads to token-efficient reasoning, faster policy improvement, and richer gradient signals, while incurring only modest overhead related to memory for per-loop logits. The results demonstrate that trajectory-level credit assignment is a robust, generalizable enhancement for LoopLMs, enabling more reliable and compact reasoning under strict inference budgets with practical implications for scalable AI reasoning systems.

Abstract

Looped Language Models (LoopLMs) perform multi-step latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM reasoning with reinforcement learning have failed - standard objectives such as Group Relative Policy Optimization (GRPO) only assign credit to the final latent state, creating a fundamental mismatch with the model's internal computation. To resolve this, we introduce RLTT (Reward Latent Thought Trajectories), a reinforcement learning framework which distributes reward across the full latent reasoning trajectory. RLTT provides dense, trajectory-level credit assignment without relying on external verifiers and can directly replace GRPO with negligible overhead. Across extensive experiments with Ouro-2.6B-Thinking under identical training and inference conditions, RLTT yields substantial improvements over GRPO on challenging mathematical reasoning benchmarks, improving accuracy by +14.4% on MATH-500, +16.6% on AIME24, and +10.0% on BeyondAIME. Despite being trained exclusively on mathematics, RLTT also transfers effectively to non-mathematical reasoning benchmarks, demonstrating the effectiveness of trajectory-level credit assignment for reinforcement learning in LoopLMs.

Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

TL;DR

This work addresses the misalignment between reinforcement learning signals and the multi-step latent reasoning processes in Looped Language Models. By introducing RLTT, which distributes credit across the entire latent thought trajectory via a trajectory-weighted policy gradient and optional KL regularization, the authors achieve substantial improvements on challenging mathematical benchmarks and notable transfer to non-mathematical reasoning. RLTT leads to token-efficient reasoning, faster policy improvement, and richer gradient signals, while incurring only modest overhead related to memory for per-loop logits. The results demonstrate that trajectory-level credit assignment is a robust, generalizable enhancement for LoopLMs, enabling more reliable and compact reasoning under strict inference budgets with practical implications for scalable AI reasoning systems.

Abstract

Looped Language Models (LoopLMs) perform multi-step latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM reasoning with reinforcement learning have failed - standard objectives such as Group Relative Policy Optimization (GRPO) only assign credit to the final latent state, creating a fundamental mismatch with the model's internal computation. To resolve this, we introduce RLTT (Reward Latent Thought Trajectories), a reinforcement learning framework which distributes reward across the full latent reasoning trajectory. RLTT provides dense, trajectory-level credit assignment without relying on external verifiers and can directly replace GRPO with negligible overhead. Across extensive experiments with Ouro-2.6B-Thinking under identical training and inference conditions, RLTT yields substantial improvements over GRPO on challenging mathematical reasoning benchmarks, improving accuracy by +14.4% on MATH-500, +16.6% on AIME24, and +10.0% on BeyondAIME. Despite being trained exclusively on mathematics, RLTT also transfers effectively to non-mathematical reasoning benchmarks, demonstrating the effectiveness of trajectory-level credit assignment for reinforcement learning in LoopLMs.
Paper Structure (40 sections, 2 theorems, 131 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 40 sections, 2 theorems, 131 equations, 6 figures, 10 tables, 1 algorithm.

Key Result

Lemma 1.4

Under Assumption assump:refinement, for every token position $j$,

Figures (6)

  • Figure 1: Comparison of GRPO vs. RLTT credit assignment. Left: GRPO only forms a direct relation between reward and the final loop's predicted next token distribution, creating a credit assignment bottleneck. Right: RLTT resolves this issue by rewarding the entire latent thought trajectory (i.e., forming a direct relation between reward and every loop's predicted next token distribution).
  • Figure 2: Reward evolution during MATH training. Mean reward (binary correctness) over training steps under RLTT & GRPO. RLTT achieves consistently higher reward than GRPO throughout training, with the performance gap emerging within the first 40 steps and widening steadily thereafter.
  • Figure 3: Response length evolution during MATH training. Mean generated response length over training steps. RLTT-trained policies converge to substantially shorter responses than GRPO, despite the reward function depending exclusively on final-answer correctness with no explicit brevity incentive.
  • Figure 4: Output token entropy evolution during MATH training. Mean entropy of the terminal-loop next-token distribution over training steps. RLTT exhibits a steeper and more sustained entropy reduction than GRPO, reflecting increased model confidence as correct reasoning trajectories stabilize.
  • Figure 5: Comparison of RLTT weight methods on MATH training dynamics.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Lemma 1.4: Per-token dominance
  • proof
  • Theorem 1.5: RLTT selects weakly smaller optimal length
  • proof