Table of Contents
Fetching ...

PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning

Keer Lu, Chong Chen, Xili Wang, Bin Cui, Yunhuai Liu, Wentao Zhang

TL;DR

The paper tackles the challenge of long-horizon planning and planner–executor coordination in LLM-based agents, addressing shortcomings of ReAct and supervised fine-tuning. It introduces AdaPlan, an adaptive global plan paradigm that unifies planning and execution, and PilotRL, a three-stage progressive reinforcement learning framework that first strengthens executor adherence, then improves global plan quality, and finally optimizes end-to-end coordination. Through extensive experiments on six agent benchmarks, PilotRL delivers state-of-the-art or competitive results and can surpass GPT-4o in several settings, highlighting the value of explicit global planning for generalization. The work demonstrates that a tightly integrated planner–executor model, guided by progressive RL and evaluated via frontier-model judges, enables robust, scalable improvements in open-source LLM agents with practical implications for real-world autonomy. The approach combines formal planning signals with RL to foster better generalization, efficiency, and coordination, offering a principled path toward adaptable, long-horizon AI agents. $P^{(t)}$ is dynamically refined by a policy $oldsymbol{ abla} oldsymbol{ ext{π}}$ as the agent interacts, illustrating the body of work's mathematical grounding in adaptive planning and learning.$

Abstract

Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent-based environments. The widely adopted agent paradigm ReAct centers on integrating single-step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long-term strategic planning. Furthermore, the coordination between the planner and executor during problem-solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution to support effective long-horizon decision-making. Based on the proposed paradigm, we further put forward PilotRL, a global planning-guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model's ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model's planning and execution coordination. Experiments indicate that PilotRL could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassing closed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT-4o-mini at a comparable parameter scale.

PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning

TL;DR

The paper tackles the challenge of long-horizon planning and planner–executor coordination in LLM-based agents, addressing shortcomings of ReAct and supervised fine-tuning. It introduces AdaPlan, an adaptive global plan paradigm that unifies planning and execution, and PilotRL, a three-stage progressive reinforcement learning framework that first strengthens executor adherence, then improves global plan quality, and finally optimizes end-to-end coordination. Through extensive experiments on six agent benchmarks, PilotRL delivers state-of-the-art or competitive results and can surpass GPT-4o in several settings, highlighting the value of explicit global planning for generalization. The work demonstrates that a tightly integrated planner–executor model, guided by progressive RL and evaluated via frontier-model judges, enables robust, scalable improvements in open-source LLM agents with practical implications for real-world autonomy. The approach combines formal planning signals with RL to foster better generalization, efficiency, and coordination, offering a principled path toward adaptable, long-horizon AI agents. is dynamically refined by a policy as the agent interacts, illustrating the body of work's mathematical grounding in adaptive planning and learning.$

Abstract

Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent-based environments. The widely adopted agent paradigm ReAct centers on integrating single-step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long-term strategic planning. Furthermore, the coordination between the planner and executor during problem-solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution to support effective long-horizon decision-making. Based on the proposed paradigm, we further put forward PilotRL, a global planning-guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model's ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model's planning and execution coordination. Experiments indicate that PilotRL could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassing closed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT-4o-mini at a comparable parameter scale.

Paper Structure

This paper contains 34 sections, 7 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Comparison of PilotRL (bottom) with existing methods (top) for agent task completion.
  • Figure 2: Overview of PilotRL. (Left) In AdaPlan paradigm, the global planner begins by processing the task instruction and generates an initial high-level plan for guidance, which is then passed to the executor for action generation. The observation from the environment is then fed back to both the executor for subsequent action generation and the global planner for plan adaptation in response to changes or unexpected outcomes. (Right) The three-stage training process of our progressive Reinforcement Learning (RL).
  • Figure 3: Normalized rewards for global planner, executor and end-to-end (E2E) performance in training LLaMA3.1-8B-Instruct.
  • Figure 4: An illustration for the Group Relative Policy Optimization (GRPO) pipeline.
  • Figure 5: Case study of ReAct yao2023react on BabyAI chevalier-boisvert2018babyai.
  • ...and 1 more figures