Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

Williams Jonathan; Tureci Esin

Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

Williams Jonathan, Tureci Esin

TL;DR

This work addresses the misalignment between reinforcement learning signals and the multi-step latent reasoning processes in Looped Language Models. By introducing RLTT, which distributes credit across the entire latent thought trajectory via a trajectory-weighted policy gradient and optional KL regularization, the authors achieve substantial improvements on challenging mathematical benchmarks and notable transfer to non-mathematical reasoning. RLTT leads to token-efficient reasoning, faster policy improvement, and richer gradient signals, while incurring only modest overhead related to memory for per-loop logits. The results demonstrate that trajectory-level credit assignment is a robust, generalizable enhancement for LoopLMs, enabling more reliable and compact reasoning under strict inference budgets with practical implications for scalable AI reasoning systems.

Abstract

Looped Language Models (LoopLMs) perform multi-step latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM reasoning with reinforcement learning have failed - standard objectives such as Group Relative Policy Optimization (GRPO) only assign credit to the final latent state, creating a fundamental mismatch with the model's internal computation. To resolve this, we introduce RLTT (Reward Latent Thought Trajectories), a reinforcement learning framework which distributes reward across the full latent reasoning trajectory. RLTT provides dense, trajectory-level credit assignment without relying on external verifiers and can directly replace GRPO with negligible overhead. Across extensive experiments with Ouro-2.6B-Thinking under identical training and inference conditions, RLTT yields substantial improvements over GRPO on challenging mathematical reasoning benchmarks, improving accuracy by +14.4% on MATH-500, +16.6% on AIME24, and +10.0% on BeyondAIME. Despite being trained exclusively on mathematics, RLTT also transfers effectively to non-mathematical reasoning benchmarks, demonstrating the effectiveness of trajectory-level credit assignment for reinforcement learning in LoopLMs.

Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

TL;DR

Abstract

Paper Structure (40 sections, 2 theorems, 131 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 40 sections, 2 theorems, 131 equations, 6 figures, 10 tables, 1 algorithm.

Introduction
Related Work
RLTT: Reward Latent Thought Trajectories
Definitions
RLTT Policy Gradient
Method
Loop Weighting Strategies
RLTT Algorithm
Experiments
Training Dynamics
Reward Evolution
Response Length Evolution
Entropy Evolution
Math Benchmarks
Non-Math Benchmarks
...and 25 more sections

Key Result

Lemma 1.4

Under Assumption assump:refinement, for every token position $j$,

Figures (6)

Figure 1: Comparison of GRPO vs. RLTT credit assignment. Left: GRPO only forms a direct relation between reward and the final loop's predicted next token distribution, creating a credit assignment bottleneck. Right: RLTT resolves this issue by rewarding the entire latent thought trajectory (i.e., forming a direct relation between reward and every loop's predicted next token distribution).
Figure 2: Reward evolution during MATH training. Mean reward (binary correctness) over training steps under RLTT & GRPO. RLTT achieves consistently higher reward than GRPO throughout training, with the performance gap emerging within the first 40 steps and widening steadily thereafter.
Figure 3: Response length evolution during MATH training. Mean generated response length over training steps. RLTT-trained policies converge to substantially shorter responses than GRPO, despite the reward function depending exclusively on final-answer correctness with no explicit brevity incentive.
Figure 4: Output token entropy evolution during MATH training. Mean entropy of the terminal-loop next-token distribution over training steps. RLTT exhibits a steeper and more sustained entropy reduction than GRPO, reflecting increased model confidence as correct reasoning trajectories stabilize.
Figure 5: Comparison of RLTT weight methods on MATH training dynamics.
...and 1 more figures

Theorems & Definitions (4)

Lemma 1.4: Per-token dominance
proof
Theorem 1.5: RLTT selects weakly smaller optimal length
proof

Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

TL;DR

Abstract

Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (4)