Table of Contents
Fetching ...

One STEP at a time: Language Agents are Stepwise Planners

Minh Nguyen, Ehsan Shareghi

TL;DR

STEP, a novel framework designed to efficiently learn from previous experiences to enhance the planning capabilities of language agents in future steps, consistently outperforms state-of-the-art models in the ScienceWorld benchmark.

Abstract

Language agents have shown promising adaptability in dynamic environments to perform complex tasks. However, despite the versatile knowledge embedded in large language models, these agents still fall short when it comes to tasks that require planning. We introduce STEP, a novel framework designed to efficiently learn from previous experiences to enhance the planning capabilities of language agents in future steps. Concretely, STEP functions through four interconnected components. First, the Planner takes on the task, breaks it down into subtasks and provides relevant insights. Then the Executor generates action candidates, while the Evaluator ensures the actions align with learned rules from previous experiences. Lastly, Memory stores experiences to inform future decisions. In the ScienceWorld benchmark, our results show that STEP consistently outperforms state-of-the-art models, achieving an overall score of 67.4 and successfully completing 12 out of 18 tasks. These findings highlight STEP's potential as a framework for enhancing planning capabilities in language agents, paving the way for more sophisticated task-solving in dynamic environments.

One STEP at a time: Language Agents are Stepwise Planners

TL;DR

STEP, a novel framework designed to efficiently learn from previous experiences to enhance the planning capabilities of language agents in future steps, consistently outperforms state-of-the-art models in the ScienceWorld benchmark.

Abstract

Language agents have shown promising adaptability in dynamic environments to perform complex tasks. However, despite the versatile knowledge embedded in large language models, these agents still fall short when it comes to tasks that require planning. We introduce STEP, a novel framework designed to efficiently learn from previous experiences to enhance the planning capabilities of language agents in future steps. Concretely, STEP functions through four interconnected components. First, the Planner takes on the task, breaks it down into subtasks and provides relevant insights. Then the Executor generates action candidates, while the Evaluator ensures the actions align with learned rules from previous experiences. Lastly, Memory stores experiences to inform future decisions. In the ScienceWorld benchmark, our results show that STEP consistently outperforms state-of-the-art models, achieving an overall score of 67.4 and successfully completing 12 out of 18 tasks. These findings highlight STEP's potential as a framework for enhancing planning capabilities in language agents, paving the way for more sophisticated task-solving in dynamic environments.

Paper Structure

This paper contains 26 sections, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of STEP with SOTA.
  • Figure 2: Planning methods. (LEFT) plan-from-the-start. (RIGHT) plan-on-the-go.
  • Figure 3: The architecture of STEP. (1) Planner receives the task and generates achievable subtasks and relevant insights, (2) Executor creates action candidates based on the generated subtasks and insights, (3) Evaluator assesses these action candidates for their alignment, and (4) Memory stores the experience for future use.
  • Figure 4: Performance example of STEP in Long and Short Tasks.
  • Figure 5: Task performance comparison between STEP and CLIN.
  • ...and 2 more figures