Table of Contents
Fetching ...

Interactive and Expressive Code-Augmented Planning with Large Language Models

Anthony Z. Liu, Xinhe Wang, Jacob Sansom, Yao Fu, Jongwook Choi, Sungryull Sohn, Jaekyeom Kim, Honglak Lee

TL;DR

In REPL-Plan, an LLM solves tasks by interacting with a Read-Eval-Print Loop (REPL), which iteratively executes and evaluates code, similar to language shells or interactive code notebooks, allowing the model to flexibly correct errors and handle tasks dynamically.

Abstract

Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making, but often struggle with complex, long-horizon planning tasks. Recent techniques have sought to structure LLM outputs using control flow and other code-adjacent techniques to improve planning performance. These techniques include using variables (to track important information) and functions (to divide complex tasks into smaller re-usable sub-tasks). However, purely code-based approaches can be error-prone and insufficient for handling ambiguous or unstructured data. To address these challenges, we propose REPL-Plan, an LLM planning approach that is fully code-expressive (it can utilize all the benefits of code) while also being dynamic (it can flexibly adapt from errors and use the LLM for fuzzy situations). In REPL-Plan, an LLM solves tasks by interacting with a Read-Eval-Print Loop (REPL), which iteratively executes and evaluates code, similar to language shells or interactive code notebooks, allowing the model to flexibly correct errors and handle tasks dynamically. We demonstrate that REPL-Plan achieves strong results across various planning domains compared to previous methods.

Interactive and Expressive Code-Augmented Planning with Large Language Models

TL;DR

In REPL-Plan, an LLM solves tasks by interacting with a Read-Eval-Print Loop (REPL), which iteratively executes and evaluates code, similar to language shells or interactive code notebooks, allowing the model to flexibly correct errors and handle tasks dynamically.

Abstract

Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making, but often struggle with complex, long-horizon planning tasks. Recent techniques have sought to structure LLM outputs using control flow and other code-adjacent techniques to improve planning performance. These techniques include using variables (to track important information) and functions (to divide complex tasks into smaller re-usable sub-tasks). However, purely code-based approaches can be error-prone and insufficient for handling ambiguous or unstructured data. To address these challenges, we propose REPL-Plan, an LLM planning approach that is fully code-expressive (it can utilize all the benefits of code) while also being dynamic (it can flexibly adapt from errors and use the LLM for fuzzy situations). In REPL-Plan, an LLM solves tasks by interacting with a Read-Eval-Print Loop (REPL), which iteratively executes and evaluates code, similar to language shells or interactive code notebooks, allowing the model to flexibly correct errors and handle tasks dynamically. We demonstrate that REPL-Plan achieves strong results across various planning domains compared to previous methods.

Paper Structure

This paper contains 39 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: REPL-Plan is an approach for augmenting LLM-planning by using LLMs to interact with LLM-REPLs, which are an extension of REPLs (Read-Eval-Print-Loops, e.g. language shells, code notebooks). [A] Observations are printed out in the main LLM-REPL and all child LLM-REPLs. [B][C] The LLM, which outputs code in the LLM-REPL line by line, interacts with the environment by calling the act function. Each LLM-REPL (including child) can directly act on the environment. [D] The LLM can "spawn" child LLM-REPLs by calling undefined functions. These functions can be used to abstract various parts of the task (e.g. finding one of the necessary objects, interpreting if the current location is closed, etc.). [E] Arguments and return values are passed between parent/child LLM-REPLs using the get_args and answer functions. [F] LLM-REPLs can be called multiple times, and continue output from the last answer statement. This allows an LLM to output consistent and correct outputs. [G] An LLM can combine these tools (child LLM-REPLs, code control flow) to express complex workflows that can solve tasks in a compact way.
  • Figure 2: A toy example of context passing that is possible in REPL-Plan. In the toy example, the task is parse all items on a search page that match a given description. We show a sample generated code in the two code snippets above, where the agent splits the task into 3 different sub-tasks: (1) filter_page, parsing any matching items on the current page, (2) parse_items, parse any item links on the current page, and (3) item_matches, determine if the current item page matches the description.
  • Figure 3: For a qualitative analysis, we include truncated versions of trajectories from REPL-Plan and the baseline ReACT on a real-world web loop-like task. In this task (\ref{['sec:rww-details']}), the agent is shown long web pages (4k-15k tokens long), and must interact with the page using element IDs (labeled with integer IDs). In the trajectories, (1) REPL-Plan is able manage large observations and long prompt contexts by sub-dividing the tasks into different LLM-REPLs. And (2), we find that in both trajectories, ReACT and REPL-Plan both run into hallucination errors from GPT4o-mini (highlighted in red). On the left, in ReACT, the LLM gets "lost", and re-checks a product it already checked before. This causes the ReACT agent to loop infinitely. On the right, in REPL-Plan, agent hallucinates a link element ID. However, due to code in the main loop REPL-Plan mitigates the effect of the hallucination --- the agent clicks the wrong element ID, but still continues to search for candidate products.
  • Figure 4: Another toy example of context passing that is possible in REPL-Plan, where context is "interleaved" between LLM-REPLs. In the toy example, the task is to count to 4. We show the generated code in the two code snippets above. In the code, main REPL could spawn another REPL to help it count only the even numbers. By passing context back and forth, we show how the final actions count to 4 in the REPL's execution trace in the bottom table.