Table of Contents
Fetching ...

Large-Language-Model-Guided State Estimation for Partially Observable Task and Motion Planning

Yoonwoo Kim, Raghav Arora, Roberto Martín-Martín, Peter Stone, Ben Abbatematteo, Yoonchang Sung

TL;DR

The planning and execution framework, CoCo-TAMP, introduces a hierarchical state estimation that uses LLM-guided information to shape the belief over task-relevant objects, enabling efficient solutions to long-horizon task and motion planning problems.

Abstract

Robot planning in partially observable environments, where not all objects are known or visible, is a challenging problem, as it requires reasoning under uncertainty through partially observable Markov decision processes. During the execution of a computed plan, a robot may unexpectedly observe task-irrelevant objects, which are typically ignored by naive planners. In this work, we propose incorporating two types of common-sense knowledge: (1) certain objects are more likely to be found in specific locations; and (2) similar objects are likely to be co-located, while dissimilar objects are less likely to be found together. Manually engineering such knowledge is complex, so we explore leveraging the powerful common-sense reasoning capabilities of large language models (LLMs). Our planning and execution framework, CoCo-TAMP, introduces a hierarchical state estimation that uses LLM-guided information to shape the belief over task-relevant objects, enabling efficient solutions to long-horizon task and motion planning problems. In experiments, CoCo-TAMP achieves an average reduction of 62.7 in planning and execution time in simulation, and 72.6 in real-world demonstrations, compared to a baseline that does not incorporate either type of common-sense knowledge.

Large-Language-Model-Guided State Estimation for Partially Observable Task and Motion Planning

TL;DR

The planning and execution framework, CoCo-TAMP, introduces a hierarchical state estimation that uses LLM-guided information to shape the belief over task-relevant objects, enabling efficient solutions to long-horizon task and motion planning problems.

Abstract

Robot planning in partially observable environments, where not all objects are known or visible, is a challenging problem, as it requires reasoning under uncertainty through partially observable Markov decision processes. During the execution of a computed plan, a robot may unexpectedly observe task-irrelevant objects, which are typically ignored by naive planners. In this work, we propose incorporating two types of common-sense knowledge: (1) certain objects are more likely to be found in specific locations; and (2) similar objects are likely to be co-located, while dissimilar objects are less likely to be found together. Manually engineering such knowledge is complex, so we explore leveraging the powerful common-sense reasoning capabilities of large language models (LLMs). Our planning and execution framework, CoCo-TAMP, introduces a hierarchical state estimation that uses LLM-guided information to shape the belief over task-relevant objects, enabling efficient solutions to long-horizon task and motion planning problems. In experiments, CoCo-TAMP achieves an average reduction of 62.7 in planning and execution time in simulation, and 72.6 in real-world demonstrations, compared to a baseline that does not incorporate either type of common-sense knowledge.
Paper Structure (24 sections, 18 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 24 sections, 18 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: The initial beliefs about the semantic locations of objects, $\text{bel}(x_{r,0}^k)$ and $\text{bel}(x_{s,0}^k)$, are derived from LLMs, while the initial beliefs about their poses, $\text{bel}(x_{p,0}^k)$, are uniformly distributed across all surfaces. The TAMP problem specification $(\mathcal{O, P, I, G, A})$, where the beliefs are incorporated into the cost of the observation action, is provided to a TAMP planner PDDLStream. The planner outputs a plan, which the robot executes. Upon executing the observation action, the co-location toggler determines whether to use the co-location model or not, based on the observed object. Then, the beliefs are updated using the proposed observation model, and the planner replans with the updated beliefs. Planning and execution are repeated until the goal state $\mathcal{G}$ is reached. Further implementation details, derivations, additional experimental results, and link to the code are available at our project page: https://coco-tamp.github.io.
  • Figure 2: Example of a simulated household environment.
  • Figure 3: Average cumulative planning and execution time over 50 environments in household layout with 6 rooms and 12 surfaces with 95% confidence interval.
  • Figure 4: Pairwise comparison of number of replans required to complete the task compared to MCQA with co-location Model with 95% confidence interval.
  • Figure 5: Performance metrics over 50 environments in a household layout (6 rooms, 12 surfaces), comparing different LLMs for the MCQA setting. The plots show the number of replans with 95% confidence intervals. We see GPT-4o outperforms the other smaller models, and hence we have used GPT-4o in all other experiments in this paper.
  • ...and 2 more figures