Table of Contents
Fetching ...

Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning

Hanjiang Hu, Changliu Liu, Na Li, Yebin Wang

TL;DR

The paper tackles the difficulty of training LLM-based agents for long-horizon, multi-turn task planning by recasting the problem as a sequence of single-turn task reasoning tasks. It introduces a theoretical and empirical framework around Group Relative Policy Optimization (GRPO) applied to a single-turn MDP built from expert trajectories with unique minimal optimality, and proves that improvements in this setting yield a lower bound on multi-turn success. Empirically, a 1.5B parameter model trained with single-turn GRPO matches or outperforms larger baselines up to 14B on the Robotouille benchmark, achieving high success rates on long-horizon tasks and demonstrating cross-task generalization and robustness to noisy demonstrations. The work highlights practical pathways to efficient, scalable LLM agents for complex planning by leveraging dense reward signals from expert trajectories and principled single-turn RL post-training.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in knowledge acquisition, reasoning, and tool use, making them promising candidates for autonomous agent applications. However, training LLM agents for complex multi-turn task planning faces significant challenges, including sparse episode-wise rewards, credit assignment across long horizons, and the computational overhead of reinforcement learning in multi-turn interaction settings. To this end, this paper introduces a novel approach that transforms multi-turn task planning into single-turn task reasoning problems, enabling efficient policy optimization through Group Relative Policy Optimization (GRPO) with dense and verifiable reward from expert trajectories. Our theoretical analysis shows that GRPO improvement on single-turn task reasoning results in a lower bound of the multi-turn success probability under the minimal turns, as well as the generalization to subtasks with shorter horizons. Experimental evaluation on the complex task planning benchmark demonstrates that our 1.5B parameter model trained with single-turn GRPO achieves superior performance compared to larger baseline models up to 14B parameters, with success rates of 70% for long-horizon planning tasks.

Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning

TL;DR

The paper tackles the difficulty of training LLM-based agents for long-horizon, multi-turn task planning by recasting the problem as a sequence of single-turn task reasoning tasks. It introduces a theoretical and empirical framework around Group Relative Policy Optimization (GRPO) applied to a single-turn MDP built from expert trajectories with unique minimal optimality, and proves that improvements in this setting yield a lower bound on multi-turn success. Empirically, a 1.5B parameter model trained with single-turn GRPO matches or outperforms larger baselines up to 14B on the Robotouille benchmark, achieving high success rates on long-horizon tasks and demonstrating cross-task generalization and robustness to noisy demonstrations. The work highlights practical pathways to efficient, scalable LLM agents for complex planning by leveraging dense reward signals from expert trajectories and principled single-turn RL post-training.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in knowledge acquisition, reasoning, and tool use, making them promising candidates for autonomous agent applications. However, training LLM agents for complex multi-turn task planning faces significant challenges, including sparse episode-wise rewards, credit assignment across long horizons, and the computational overhead of reinforcement learning in multi-turn interaction settings. To this end, this paper introduces a novel approach that transforms multi-turn task planning into single-turn task reasoning problems, enabling efficient policy optimization through Group Relative Policy Optimization (GRPO) with dense and verifiable reward from expert trajectories. Our theoretical analysis shows that GRPO improvement on single-turn task reasoning results in a lower bound of the multi-turn success probability under the minimal turns, as well as the generalization to subtasks with shorter horizons. Experimental evaluation on the complex task planning benchmark demonstrates that our 1.5B parameter model trained with single-turn GRPO achieves superior performance compared to larger baseline models up to 14B parameters, with success rates of 70% for long-horizon planning tasks.

Paper Structure

This paper contains 21 sections, 6 theorems, 20 equations, 4 tables.

Key Result

Theorem 3.1

Let $\pi^*$ be the fixed point of GRPO-optimized policy obtained from the policy iteration dynamics of eq:grpo_iter and $\pi^{\text{ref}}$ be the initial reference policy. Denote the success probability of the reference policy as $p^{\text{ref}} = \mathbb{E}_{a \sim \pi^{\text{ref}}(\cdot|s^{GT})} [

Theorems & Definitions (20)

  • Definition 2.1: Successful trajectory by stochastic policy
  • Definition 2.2: Expert trajectory by stochastic policy
  • Definition 2.3: Expert policy from expert trajectory
  • Remark 2.1
  • Definition 2.4: Reward function from expert trajectory
  • Theorem 3.1: Theorem 3 from mroueh2025reinforcement
  • proof
  • Corollary 3.1: GRPO single-Turn optimality
  • proof
  • Definition 3.1: Minimal turns for task completion
  • ...and 10 more