TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

Aladin Djuhera; Swanand Ravindra Kadhe; Farhan Ahmed; Holger Boche

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Holger Boche

TL;DR

TSR reframes search techniques from inference-time to training-time rollout generation for multi-turn RL with LLM agents. By performing lightweight tree-search over per-turn action prefixes with a task-specific scoring function, TSR constructs higher-quality trajectories without changing the underlying optimization objective, making it optimizer-agnostic. Instantiations with best-of-$N$, beam search, and shallow lookahead, combined with instance filtering, yield up to 15% gains across Sokoban, FrozenLake, and WebShop while maintaining stable training. The approach improves both stability and inference efficiency, and enables smaller models to outperform larger generalist models on diverse tasks, highlighting the practical impact of training-time compute dedicated to rollout quality.

Abstract

Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using task-specific feedback. This improves rollout quality and stabilizes learning while leaving the underlying optimization objective unchanged, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a one-time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a simple and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

TL;DR

, beam search, and shallow lookahead, combined with instance filtering, yield up to 15% gains across Sokoban, FrozenLake, and WebShop while maintaining stable training. The approach improves both stability and inference efficiency, and enables smaller models to outperform larger generalist models on diverse tasks, highlighting the practical impact of training-time compute dedicated to rollout quality.

Abstract

Paper Structure (57 sections, 15 equations, 7 figures, 10 tables, 1 algorithm)

This paper contains 57 sections, 15 equations, 7 figures, 10 tables, 1 algorithm.

Introduction
Background and Problem Setup
Multi-Turn RL as a Partially Observable Markov Decision Process (POMDP)
Policy Optimization for Multi-Turn RL
Why Rollout Quality and Diversity Drive Stability
Optimizing Rollouts with Trajectory Search
Trajectory Rollouts via Tree Search
Candidate Action Set.
Scoring Function.
Trajectory Search Rollouts (TSR).
Search Strategies
Trajectory-Level Best-of-$N$.
Per-Turn Beam Search.
Shallow Lookahead Search.
Instance Filtering for Enhanced Task Diversity
...and 42 more sections

Figures (7)

Figure 1: The "Corner Trap": An Irreversible Mistake. A naive rollout sees Push Right as progress, but it traps the box between the wall and the pillar. Best-of-$N=4$ rollout sampling explores multiple possibilities and selects one that avoids the dead-end.
Figure 2: (Left) Multi-turn RL with naive rollouts: trajectories are sampled independently without any search. (Right) Trajectory Search Rollouts (TSR): lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn.
Figure 3: Success Rate on Held-Out Validation Sets. Comparison of TSR variants (Best-of-$N$, Lookahead, Beam Search) against the Instance Filtering baseline across all environments. Shaded regions show standard deviation across 3 runs.
Figure 4: Exploitation, Exploration, and Stability Metrics for Sokoban (Qwen2.5-3B).(a) TSR methods achieve higher average training rewards than Instance Filtering, indicating improved exploitation from higher-quality rollouts. (b) Rollout entropy decreases smoothly over training, suggesting sustained exploration early on, followed by gradual policy consolidation. (c) Gradient norms remain stable and free of large spikes across TSR methods, consistent with stable optimization dynamics and the absence of Echo Trap collapse.
Figure 5: Illustration of TSR tree-search–based rollout generation with best-of-$N$, beam search, and shallow lookahead search methods.
...and 2 more figures

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

TL;DR

Abstract

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (7)