Table of Contents
Fetching ...

Kinodynamic Task and Motion Planning using VLM-guided and Interleaved Sampling

Minseo Kwon, Young J. Kim

TL;DR

A kinodynamic TAMP framework based on a hybrid state tree that uniformly represents symbolic and numeric states during planning, enabling task and motion decisions to be jointly decided is proposed, enabling task and motion decisions to be jointly decided.

Abstract

Task and Motion Planning (TAMP) integrates high-level task planning with low-level motion feasibility, but existing methods are costly in long-horizon problems due to excessive motion sampling. While LLMs provide commonsense priors, they lack 3D spatial reasoning and cannot ensure geometric or dynamic feasibility. We propose a kinodynamic TAMP framework based on a hybrid state tree that uniformly represents symbolic and numeric states during planning, enabling task and motion decisions to be jointly decided. Kinodynamic constraints embedded in the TAMP problem are verified by an off-the-shelf motion planner and physics simulator, and a VLM guides exploring a TAMP solution and backtracks the search based on visual rendering of the states. Experiments on the simulated domains and in the real world show 32.14% - 1166.67% increased average success rates compared to traditional and LLM-based TAMP planners and reduced planning time on complex problems, with ablations further highlighting the benefits of VLM guidance.

Kinodynamic Task and Motion Planning using VLM-guided and Interleaved Sampling

TL;DR

A kinodynamic TAMP framework based on a hybrid state tree that uniformly represents symbolic and numeric states during planning, enabling task and motion decisions to be jointly decided is proposed, enabling task and motion decisions to be jointly decided.

Abstract

Task and Motion Planning (TAMP) integrates high-level task planning with low-level motion feasibility, but existing methods are costly in long-horizon problems due to excessive motion sampling. While LLMs provide commonsense priors, they lack 3D spatial reasoning and cannot ensure geometric or dynamic feasibility. We propose a kinodynamic TAMP framework based on a hybrid state tree that uniformly represents symbolic and numeric states during planning, enabling task and motion decisions to be jointly decided. Kinodynamic constraints embedded in the TAMP problem are verified by an off-the-shelf motion planner and physics simulator, and a VLM guides exploring a TAMP solution and backtracks the search based on visual rendering of the states. Experiments on the simulated domains and in the real world show 32.14% - 1166.67% increased average success rates compared to traditional and LLM-based TAMP planners and reduced planning time on complex problems, with ablations further highlighting the benefits of VLM guidance.

Paper Structure

This paper contains 20 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: An overview of our approach. Given a problem PDDL and a domain PDDL, a top-$k$ symbolic planner generates a discrete state graph $G$. Guided by $G$, we expand a hybrid state tree $T$ where each edge is expanded by motion planning and validated by physics simulation. If a node $h_t$ fails to expand, we retry random sampling up to $K$ times; if still unsuccessful, we prompt the VLM to predict a backtrack node $h_r$, from which expansion resumes. This process repeats until a goal is found or the timeout is reached.
  • Figure 2: Blocksworld domain setup with a PR2 robot. We only enable the left arm for this task. Initially, $n=6$ blocks are randomly placed on a table (left), and the goal is to restack them in a new order (right).
  • Figure 3: Kitchen domain setup with a KUKA IIWA robot. Initially, six food objects are surrounded by 12 distractors (left), and the goal is to cook the food on the stove (right). The distractors are fixed objects.
  • Figure 4: Robotic demonstration of our TAMP planner in the Blocksworld domain ($n=6$). The initial configuration consists of six blocks stacked on the table (leftmost image), and the goal is to rearrange them into the stacking sequence shown in 14. Only the first four and last four actions, $\pi = \{a_1, a_2, a_3, a_4, \cdots, a_{11}, a_{12}, a_{13}, a_{14}\}$, are shown here.