Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning
Hanjiang Hu, Changliu Liu, Na Li, Yebin Wang
TL;DR
The paper tackles the difficulty of training LLM-based agents for long-horizon, multi-turn task planning by recasting the problem as a sequence of single-turn task reasoning tasks. It introduces a theoretical and empirical framework around Group Relative Policy Optimization (GRPO) applied to a single-turn MDP built from expert trajectories with unique minimal optimality, and proves that improvements in this setting yield a lower bound on multi-turn success. Empirically, a 1.5B parameter model trained with single-turn GRPO matches or outperforms larger baselines up to 14B on the Robotouille benchmark, achieving high success rates on long-horizon tasks and demonstrating cross-task generalization and robustness to noisy demonstrations. The work highlights practical pathways to efficient, scalable LLM agents for complex planning by leveraging dense reward signals from expert trajectories and principled single-turn RL post-training.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in knowledge acquisition, reasoning, and tool use, making them promising candidates for autonomous agent applications. However, training LLM agents for complex multi-turn task planning faces significant challenges, including sparse episode-wise rewards, credit assignment across long horizons, and the computational overhead of reinforcement learning in multi-turn interaction settings. To this end, this paper introduces a novel approach that transforms multi-turn task planning into single-turn task reasoning problems, enabling efficient policy optimization through Group Relative Policy Optimization (GRPO) with dense and verifiable reward from expert trajectories. Our theoretical analysis shows that GRPO improvement on single-turn task reasoning results in a lower bound of the multi-turn success probability under the minimal turns, as well as the generalization to subtasks with shorter horizons. Experimental evaluation on the complex task planning benchmark demonstrates that our 1.5B parameter model trained with single-turn GRPO achieves superior performance compared to larger baseline models up to 14B parameters, with success rates of 70% for long-horizon planning tasks.
