A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning
Ruiyi Wang, Prithviraj Ammanabrolu
TL;DR
The paper tackles the fragmentation in multi-turn agentic RL by decomposing the problem into environment, reward, and policy pillars and deriving a practical training recipe validated across TextWorld, ALFWorld, and SWE-Gym. It demonstrates that curriculum-style environment design, strong policy priors from supervised fine-tuning, and dense, execution-based reward signals are crucial for stable, efficient learning of long-horizon agentic behaviors. Key findings include robust cross-environment generalization from simpler to complex tasks, the superiority of biased yet stable optimization (PPO/GRPO) over unbiased estimators in many settings, and the necessity of fine-grained verifier rewards over model-based proxies. The work offers actionable guidance for researchers and practitioners aiming to build autonomous, multi-turn agents capable of interacting with real-world textual and programming environments, and it provides reproducible resources to accelerate future work.
Abstract
We study what actually works and what doesn't for training large language models as agents via multi-turn reinforcement learning. Despite rapid progress, existing frameworks and definitions are fragmented, and there is no systematic formulation or analysis of which design choices matter across tasks. We address this gap by first breaking down the design space into three inter-related pillars -- environment, reward, and policy -- and empirically derive a recipe for training LLM agents in situated textual domains. In particular, we test TextWorld and ALFWorld, popular domains for testing situated embodied reasoning, as well as SWE-Gym for more software engineering style tasks. (i) For the environment, we analyze the impacts of task complexity in terms of sizes of the state and action spaces as well as optimal solution length, finding that even simple environments within a domain can provide signal on how well an agent can generalize to more complex tasks. (ii) For the reward, we ablate relative reward sparsity, observing that while dense turn-level rewards accelerate training, performance and stability is highly dependent on the choice of RL algorithm. (iii) And for the agent's policy, we explore the interplay between reward sparsity and biased (PPO, GRPO) and unbiased (RLOO) policy gradient methods in addition to showing how to find the optimal Supervised Fine-tuning (SFT) to RL training ratio given a fixed budget. We distill these findings into a training recipe that guides co-design across the three pillars, facilitating research and practical efforts in multi-turn agentic RL. Code: https://github.com/pearls-lab/meow-tea-taro
