TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Holger Boche
TL;DR
TSR reframes search techniques from inference-time to training-time rollout generation for multi-turn RL with LLM agents. By performing lightweight tree-search over per-turn action prefixes with a task-specific scoring function, TSR constructs higher-quality trajectories without changing the underlying optimization objective, making it optimizer-agnostic. Instantiations with best-of-$N$, beam search, and shallow lookahead, combined with instance filtering, yield up to 15% gains across Sokoban, FrozenLake, and WebShop while maintaining stable training. The approach improves both stability and inference efficiency, and enables smaller models to outperform larger generalist models on diverse tasks, highlighting the practical impact of training-time compute dedicated to rollout quality.
Abstract
Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using task-specific feedback. This improves rollout quality and stabilizes learning while leaving the underlying optimization objective unchanged, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a one-time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a simple and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.
