Guiding Long-Horizon Task and Motion Planning with Vision Language Models

Zhutian Yang; Caelan Garrett; Dieter Fox; Tomás Lozano-Pérez; Leslie Pack Kaelbling

Guiding Long-Horizon Task and Motion Planning with Vision Language Models

Zhutian Yang, Caelan Garrett, Dieter Fox, Tomás Lozano-Pérez, Leslie Pack Kaelbling

TL;DR

This work tackles long-horizon robotic manipulation where Vision-Language Models (VLMs) lack reliable geometric reasoning. It introduces VLM-TAMP, a hierarchical planner that uses a VLM to generate semantically meaningful subgoals and a Task and Motion Planner (TAMP) to ground them into feasible trajectories, with replanning when necessary. The approach is evaluated on kitchen-cooking tasks requiring 30–50 actions and up to 21 objects, showing substantial improvements in success rates and task completion over baselines that simply execute VLM-generated actions. The findings demonstrate that subgoal-based prompting, coupled with iterative TAMP grounding and VLM reprompting, effectively bridges high-level reasoning and low-level feasibility for long-horizon manipulation.

Abstract

Vision-Language Models (VLM) can generate plausible high-level plans when prompted with a goal, the context, an image of the scene, and any planning constraints. However, there is no guarantee that the predicted actions are geometrically and kinematically feasible for a particular robot embodiment. As a result, many prerequisite steps such as opening drawers to access objects are often omitted in their plans. Robot task and motion planners can generate motion trajectories that respect the geometric feasibility of actions and insert physically necessary actions, but do not scale to everyday problems that require common-sense knowledge and involve large state spaces comprised of many variables. We propose VLM-TAMP, a hierarchical planning algorithm that leverages a VLM to generate goth semantically-meaningful and horizon-reducing intermediate subgoals that guide a task and motion planner. When a subgoal or action cannot be refined, the VLM is queried again for replanning. We evaluate VLM- TAMP on kitchen tasks where a robot must accomplish cooking goals that require performing 30-50 actions in sequence and interacting with up to 21 objects. VLM-TAMP substantially outperforms baselines that rigidly and independently execute VLM-generated action sequences, both in terms of success rates (50 to 100% versus 0%) and average task completion percentage (72 to 100% versus 15 to 45%). See project site https://zt-yang.github.io/vlm-tamp-robot/ for more information.

Guiding Long-Horizon Task and Motion Planning with Vision Language Models

TL;DR

Abstract

Paper Structure (27 sections, 5 figures, 1 algorithm)

This paper contains 27 sections, 5 figures, 1 algorithm.

Introduction
Related Work
Method
Problem Formulation
Approach
Using VLM for Subgoal or Action Sequencing
Predicting Subgoals
Predicting Actions
Using TAMP to Refine Subgoals or Action Sequences
TAMP problems
Planning for Subgoals
Refining Actions
VLM Replanning after TAMP Failure
Experiments
Baselines and Ablations
...and 12 more sections

Figures (5)

Figure 2: Example trajectories of different robots achieving the same goal of having the cabbage in the pot, where the cabbage is placed in a drawer and the pot is hard to reach. While the VLM may not be able to generate feasible action plans based on text and image description of the scene, TAMP can find the shortest feasible task plans that move obstacles if necessary and respect the kinematic constraints of the robot and other articulated objects.
Figure 3: An example input image to the VLM, which are annotated with object names and bounding boxes. The top image marks movable objects and articulated joints, while the bottom image marks movable objects and placement surfaces.
Figure 4: Conversation Template for querying VLMs and example responses used by VLM-TAMP (ab) and baseline VLM + Motion Planning (ac). The same templates are used during reprompting, with text formatted using {Purple} representing updated information.
Figure 5: Our VLM-TAMP Algorithm. PM means Problem Manager, which formulates the next TAMP sub-problem to solve. TP means Task Planning, which checks the semantics of subgoals.
Figure 6: Our experimental results show that that 1) predicting subgoals (VLM-TAMP) outperforms predicting actions, 2) reprompting helps when subgoals (VLM-TAMP) as number of reprompt tries increases but not when predicting actions. All six methods are run for 30 random trials on four problem difficulties, with increasing numbers controllable robot arms and manipulable obstacles.

Guiding Long-Horizon Task and Motion Planning with Vision Language Models

TL;DR

Abstract

Guiding Long-Horizon Task and Motion Planning with Vision Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)