Table of Contents
Fetching ...

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, Amir Gholami

TL;DR

The paper addresses the challenge of long-horizon task planning by introducing Plan-and-Act, a planner-executor framework that explicitly separates high-level reasoning from low-level execution. It leverages a scalable synthetic data pipeline to train the Planner, and integrates dynamic replanning and chain-of-thought reasoning to improve robustness and generalization in web-navigation tasks. Empirical results on WebArena-Lite and WebVoyager demonstrate state-of-the-art or competitive performance, highlighting the value of planning-driven architectures and large-scale synthetic data for training open models. The approach offers a scalable, modular path to more reliable language-based agents in dynamic digital environments, with potential applicability beyond web navigation.

Abstract

Large language models (LLMs) have shown remarkable advancements in enabling language agents to tackle simple tasks. However, applying them for complex, multi-step, long-horizon tasks remains a challenge. Recent work have found success by separating high-level planning from low-level execution, which enables the model to effectively balance high-level planning objectives and low-level execution details. However, generating accurate plans remains difficult since LLMs are not inherently trained for this task. To address this, we propose Plan-and-Act, a novel framework that incorporates explicit planning into LLM-based agents and introduces a scalable method to enhance plan generation through a novel synthetic data generation method. Plan-and-Act consists of a Planner model which generates structured, high-level plans to achieve user goals, and an Executor model that translates these plans into environment-specific actions. To train the Planner effectively, we introduce a synthetic data generation method that annotates ground-truth trajectories with feasible plans, augmented with diverse and extensive examples to enhance generalization. We evaluate Plan-and-Act using web navigation as a representative long-horizon planning environment, demonstrating a state-of-the-art 57.58% success rate on the WebArena-Lite benchmark as well as a text-only state-of-the-art 81.36% success rate on WebVoyager.

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

TL;DR

The paper addresses the challenge of long-horizon task planning by introducing Plan-and-Act, a planner-executor framework that explicitly separates high-level reasoning from low-level execution. It leverages a scalable synthetic data pipeline to train the Planner, and integrates dynamic replanning and chain-of-thought reasoning to improve robustness and generalization in web-navigation tasks. Empirical results on WebArena-Lite and WebVoyager demonstrate state-of-the-art or competitive performance, highlighting the value of planning-driven architectures and large-scale synthetic data for training open models. The approach offers a scalable, modular path to more reliable language-based agents in dynamic digital environments, with potential applicability beyond web navigation.

Abstract

Large language models (LLMs) have shown remarkable advancements in enabling language agents to tackle simple tasks. However, applying them for complex, multi-step, long-horizon tasks remains a challenge. Recent work have found success by separating high-level planning from low-level execution, which enables the model to effectively balance high-level planning objectives and low-level execution details. However, generating accurate plans remains difficult since LLMs are not inherently trained for this task. To address this, we propose Plan-and-Act, a novel framework that incorporates explicit planning into LLM-based agents and introduces a scalable method to enhance plan generation through a novel synthetic data generation method. Plan-and-Act consists of a Planner model which generates structured, high-level plans to achieve user goals, and an Executor model that translates these plans into environment-specific actions. To train the Planner effectively, we introduce a synthetic data generation method that annotates ground-truth trajectories with feasible plans, augmented with diverse and extensive examples to enhance generalization. We evaluate Plan-and-Act using web navigation as a representative long-horizon planning environment, demonstrating a state-of-the-art 57.58% success rate on the WebArena-Lite benchmark as well as a text-only state-of-the-art 81.36% success rate on WebVoyager.

Paper Structure

This paper contains 56 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: An illustration of Plan-and-Act System Diagram. First, the Planner LLM processes the initial user query and generates an initial step by step plan (Section \ref{['sec:planner']}). This is then passed to the Executor LLM which uses the plan and generates an actions to interact with its Environment. The environment feedback is then fed back to both the Executor so it can generate subsequent actions and/or to the Planner in case a new plan needs to be generated. Existing methods have shown this separation of high-level planning and low-level execution can improve accuracy. However, a major challenge is that LLMs are not generally trained to generate such plan/low-level action, a problem that we focus on solving in this paper.
  • Figure 2: Plan-and-Act System Diagram. Given the initial user query, the Planner (Section \ref{['sec:planner']}) breaks it down into a high-level plan, which is given to the Executor (Section \ref{['sec:executor']}) which uses the plan to guide its actions. Once the action has been taken and the HTML changes, the Planner dynamically generates a new plan that incorporates the changes in the environment (Section \ref{['sec:replanning']}).
  • Figure 3: Synthetic Data Generation Pipeline. In the Action Trajectory Generation stage (Section \ref{['sec:synthetic_trajectory_generation']}), user queries from the training data are given to a Teacher LLM, which outputs synthetic user instructions. From there, a demonstrator actor LLM attempts to execute the query on the webpage. After the trajectory is finished, an ORM LLM is used to filter for successful trajectories. In the Grounded Plan Generation stage (Section \ref{['sec:planner_annotation']}), a Teacher LLM takes the trajectory and creates a synthetic high-level plan and grounds each step with explicit actions in the trajectory. In the Synthetic Plan Expansion stage (Section \ref{['sec:plan_augmentation']}), the plans from the training data are sampled and given to the Teacher LLM, which generates new synthetic plans.
  • Figure 4: Task performance metrics by website.
  • Figure 5: Model hyperparameters for training and inference