Table of Contents
Fetching ...

Using VLM Reasoning to Constrain Task and Motion Planning

Muyang Yan, Miras Mengdibayev, Ardon Floros, Weihang Guo, Lydia E. Kavraki, Zachary Kingston

TL;DR

This work addresses the downward refinement gap in task and motion planning by using Vision-Language Models (VLMs) to infer general geometric constraints before planning. The proposed VIZ-COAST framework integrates a Visual Reasoning Module that translates scene imagery into Z3 SMT constraints, interfacing with an SMT-based task planner while leaving motion grounding to a stream-based planner like COAST. Across Blocks and Containers domains, VIZ-COAST substantially reduces or eliminates downward refinement failures and speeds up planning compared to baselines, with zero-shot generalization to unseen instances. The results demonstrate that VLM-driven preplanning can meaningfully accelerate long-horizon robotic planning, though latency and real-world validation remain important directions for future work.

Abstract

In task and motion planning, high-level task planning is done over an abstraction of the world to enable efficient search in long-horizon robotics problems. However, the feasibility of these task-level plans relies on the downward refinability of the abstraction into continuous motion. When a domain's refinability is poor, task-level plans that appear valid may ultimately fail during motion planning, requiring replanning and resulting in slower overall performance. Prior works mitigate this by encoding refinement issues as constraints to prune infeasible task plans. However, these approaches only add constraints upon refinement failure, expending significant search effort on infeasible branches. We propose VIZ-COAST, a method of leveraging the common-sense spatial reasoning of large pretrained Vision-Language Models to identify issues with downward refinement a priori, bypassing the need to fix these failures during planning. Experiments on two challenging TAMP domains show that our approach is able to extract plausible constraints from images and domain descriptions, drastically reducing planning times and, in some cases, eliminating downward refinement failures altogether, generalizing to a diverse range of instances from the broader domain.

Using VLM Reasoning to Constrain Task and Motion Planning

TL;DR

This work addresses the downward refinement gap in task and motion planning by using Vision-Language Models (VLMs) to infer general geometric constraints before planning. The proposed VIZ-COAST framework integrates a Visual Reasoning Module that translates scene imagery into Z3 SMT constraints, interfacing with an SMT-based task planner while leaving motion grounding to a stream-based planner like COAST. Across Blocks and Containers domains, VIZ-COAST substantially reduces or eliminates downward refinement failures and speeds up planning compared to baselines, with zero-shot generalization to unseen instances. The results demonstrate that VLM-driven preplanning can meaningfully accelerate long-horizon robotic planning, though latency and real-world validation remain important directions for future work.

Abstract

In task and motion planning, high-level task planning is done over an abstraction of the world to enable efficient search in long-horizon robotics problems. However, the feasibility of these task-level plans relies on the downward refinability of the abstraction into continuous motion. When a domain's refinability is poor, task-level plans that appear valid may ultimately fail during motion planning, requiring replanning and resulting in slower overall performance. Prior works mitigate this by encoding refinement issues as constraints to prune infeasible task plans. However, these approaches only add constraints upon refinement failure, expending significant search effort on infeasible branches. We propose VIZ-COAST, a method of leveraging the common-sense spatial reasoning of large pretrained Vision-Language Models to identify issues with downward refinement a priori, bypassing the need to fix these failures during planning. Experiments on two challenging TAMP domains show that our approach is able to extract plausible constraints from images and domain descriptions, drastically reducing planning times and, in some cases, eliminating downward refinement failures altogether, generalizing to a diverse range of instances from the broader domain.

Paper Structure

This paper contains 20 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: VIZ-COAST uses VLMs and SMT-based task planning to constrain TAMP problems. Our constraints generalize to unseen problem instances within a broader domain zero-shot. VIZ-COAST introduces minimal search overhead, enabling more efficient planning than prior state-of-the-art.
  • Figure 2: The VIZ-COAST architecture. Our Visual Reasoning Module takes as input an example scene, consisting of an image and the geometric state, the PDDL domain description, and an example of a syntactically compliant output. It produces a Python file encoding constraints through a high-level API to the SMT-based task planner. When presented with a new problem instance, the task planner takes these constraints as input, in addition to the PDDL domain and new problem descriptions. The planner applies the constraints to block geometrically infeasible task plans. The high-level plan produced is then grounded by a streams-based motion planner to produce an executable continuous motion plan.
  • Figure 3: VIZ-COAST's Visual Reasoning Module infers constraints through a 4-step prompting procedure. First, it is asked to provide an interpretation of the scene based on an image of the example problem instance and the PDDL domain description. Then, it is asked to articulate in natural language the necessary constraints. After that, the VLM is asked to formally encode the constraints it identified in Python using Z3's API, referencing a structural example. Finally, it proofreads the script to ensure syntactic compliance.
  • Figure 4: Paraphrased example of a constraints file produced by the Visual Reasoning Module for the Containers domain. The VRM produces a function which calls a high-level API to block action assignments under certain conditions, constraining the search space of the task planner. This code prevents any pick or place action whose target is a closed container.
  • Figure 5: The Block domain (left) requires the robot to rearrange blocks to move the red block to the center tile. The end-effector may only approach the grid from a single direction. The Containers domain (right) requires the robot to place the items in target containers, removing and replacing their lids as needed.
  • ...and 2 more figures