Scalable Task Planning via Large Language Models and Structured World Representations

Rodrigo Pérez-Dattari; Zhaoting Li; Robert Babuška; Jens Kober; Cosimo Della Santina

Scalable Task Planning via Large Language Models and Structured World Representations

Rodrigo Pérez-Dattari, Zhaoting Li, Robert Babuška, Jens Kober, Cosimo Della Santina

TL;DR

This paper tackles the intractability of large-scale task planning by marrying a graph-based world model with taxonomy-guided object reduction and LLM-driven pruning. The core idea is to reduce the state space before planning by using two LLM-guided steps: (i) a taxonomy-aware object selection that narrows the relevant objects, and (ii) a relationship-based refinement that accounts for environment-specific interactions, all grounded in a graph representation S=(O,R). The authors demonstrate that planning on the pruned state graph, using either search-based or LLM-based policies, achieves high success rates in VirtualHome and scales to real-world 7-DoF manipulation tasks, with GPT-4o consistently outperforming GPT-3.5. This approach yields substantial improvements in scalability and practicality for robotic task planning, offering a zero-shot pathway to handle thousands of objects without retraining, validated through extensive simulation and real-system experiments.

Abstract

Planning methods struggle with computational intractability in solving task-level problems in large-scale environments. This work explores leveraging the commonsense knowledge encoded in LLMs to empower planning techniques to deal with these complex scenarios. We achieve this by efficiently using LLMs to prune irrelevant components from the planning problem's state space, substantially simplifying its complexity. We demonstrate the efficacy of this system through extensive experiments within a household simulation environment, alongside real-world validation using a 7-DoF manipulator (video https://youtu.be/6ro2UOtOQS4).

Scalable Task Planning via Large Language Models and Structured World Representations

TL;DR

Abstract

Paper Structure (26 sections, 6 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 26 sections, 6 equations, 6 figures, 2 tables, 2 algorithms.

Introduction
Related Work
Preliminaries
Task Planning
Graph-based State Representation
Problem Formulation
Method
Step 1: Environment-agnostic Object Selection
Taxonomy Graph States
Interacting with the taxonomy
Step 2: Relationship-based Object Selection
Iterative selection process
Integrating Taxonomy Graph States with LLMs
Planning with State Graphs
Search-based Policies
...and 11 more sections

Figures (6)

Figure 1: Summary of the proposed framework. A taxonomy of object classes, where the lower level represents objects (e.g., a computer) and the higher levels represent groups of objects (e.g., electronics), is combined with knowledge of the environment to create a state graph that indicates attributes of objects (e.g., A is open) and relationships between them (e.g., A is inside B). For a given task goal, an LLM uses the object taxonomy, the state graph, and its commonsense knowledge to derive a reduced graph that contains only the necessary objects to achieve the task planning problem.
Figure 2: Example of the proposed method. Step 1 ($LLM^{\mathcal{T}}$): Relevant objects are selected from an object taxonomy $C$. At the highest hierarchical level ($\ell=1$), three object categories are provided to the LLM, which selects two as relevant ($\bar{C}_{\ell}$). The child nodes of $\bar{C}_{\ell}$ ($\hat{C}_{\ell+1}$), obtained via $\Psi$, are then provided to the LLM to further select the relevant ones ($\bar{C}_{\ell+1}$). Step 2 ($LLM^{\mathcal{R}}$): Relevant relationships are iteratively selected using the graph state $S$. In the first iteration ($i=0$), the objects obtained from Step 1 ($\bar{O}^{\mathcal{T}}$) are fed into the function $\Phi$, which locally expands the graph to identify objects interacting with $\bar{O}^{\mathcal{T}}$. One object is identified ($S^{\mathcal{E}}_{i}$). Subsequently, the LLM determines that the interacting object is relevant. A second iteration starts ($i=1$), where the new graph $\bar{O}_{i}^{\mathcal{T}\cup\mathcal{R}}$ is expanded with interacting objects $S^{\mathcal{E}}_{i+1}$. However, none of these objects are deemed relevant by the LLM, concluding the object selection process.
Figure 3: Example of a VirtualHome environment: The agent can navigate the house and interact with objects in multiple rooms. Six environments were employed, with the number of objects the agent could interact with ranging from 221 to 348. These objects have properties such as being openable, grabbable, or switchable. Image extracted from http://virtual-home.org/.
Figure 4: Performance comparison of state space size reduction using $\text{GPT-3.5}$ and GPT-4o for tasks containing one to five subtasks. Objects missing indicates that some necessary objects for the task are not selected, while format error means the LLM's output does not adhere to the required format.
Figure 5: Example of a real-world execution of the proposed method. Objective: put the tomatoes inside the crate, the rotten tomatoes (blue ball) in the bin (box in the corner of the table) and press the button once you finish.
...and 1 more figures

Scalable Task Planning via Large Language Models and Structured World Representations

TL;DR

Abstract

Scalable Task Planning via Large Language Models and Structured World Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (6)