Table of Contents
Fetching ...

Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards

Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, Xiang Ren

TL;DR

The paper tackles limited trajectory diversity in RLVR rollouts by introducing Lookahead Tree-Based Rollouts (LATR), a tree-based strategy that branches at high-uncertainty points, runs lookahead simulations, and prunes non-divergent paths to cultivate diverse reasoning trajectories. Implemented within GRPO and DAPO, LATR yields faster policy learning (average ~131% acceleration) and improved final accuracy (pass@1 +%4.2) across multiple logical and mathematical benchmarks, while reducing inference costs and trajectory lengths. Key innovations include a probabilistic branching criterion, lookahead-based pruning using semantic divergence, and optional early-stopping and hybrid rollout schemes to balance diversity with test-time behavior. Overall, LATR demonstrates that trajectory-level diversity is a crucial driver for scaling RLVR in complex reasoning tasks, offering practical gains in efficiency and performance with broad applicability to existing RLVR pipelines.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at https://github.com/starreeze/latr.

Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards

TL;DR

The paper tackles limited trajectory diversity in RLVR rollouts by introducing Lookahead Tree-Based Rollouts (LATR), a tree-based strategy that branches at high-uncertainty points, runs lookahead simulations, and prunes non-divergent paths to cultivate diverse reasoning trajectories. Implemented within GRPO and DAPO, LATR yields faster policy learning (average ~131% acceleration) and improved final accuracy (pass@1 +%4.2) across multiple logical and mathematical benchmarks, while reducing inference costs and trajectory lengths. Key innovations include a probabilistic branching criterion, lookahead-based pruning using semantic divergence, and optional early-stopping and hybrid rollout schemes to balance diversity with test-time behavior. Overall, LATR demonstrates that trajectory-level diversity is a crucial driver for scaling RLVR in complex reasoning tasks, offering practical gains in efficiency and performance with broad applicability to existing RLVR pipelines.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at https://github.com/starreeze/latr.

Paper Structure

This paper contains 48 sections, 12 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of conventional token-level stochastic sampling and our proposed method LATR on sampling process, rollout sequence diversity, and performance on DAPO Math dataset.
  • Figure 2: An overview of LATR. A dynamic search tree is built by branching on model uncertainty, simulating and pruning similar branches, resulting in diverse answers and reasoning paths.
  • Figure 3: Learning curve comparison on Countdown (left) and DAPO-Math (right) datasets.
  • Figure 4: Comparison of test correctness with different temperature $t$ (%).
  • Figure 5: Comparison of test correctness with different rollout number $k$ (%).
  • ...and 3 more figures