Table of Contents
Fetching ...

Grounded Vision-Language Interpreter for Integrated Task and Motion Planning

Jeremy Siburian, Keisuke Shirai, Cristian C. Beltran-Hernandez, Masashi Hamaya, Michael Görner, Atsushi Hashimoto

TL;DR

ViLaIn-TAMP addresses safety, interpretability, and robustness in vision-language guided planning for long-horizon manipulation. It combines ViLaIn-based PDDL problem generation, a TAMP pipeline that couples symbolic planning with geometric grounding via MoveIt Task Constructor, and a corrective planning loop that uses grounded failure feedback to replan up to a defined maximum. In cooking-domain tasks, it outperforms a VLM-as-planner baseline by 18% in mean success rate, and the CP module boosts success by 32%, with validation on real robots. This work advances interpretable, verifiable robot planning by integrating symbolic verification, grounded failure analysis, and real-world execution into a unified framework.

Abstract

While recent advances in vision-language models have accelerated the development of language-guided robot planners, their black-box nature often lacks safety guarantees and interpretability crucial for real-world deployment. Conversely, classical symbolic planners offer rigorous safety verification but require significant expert knowledge for setup. To bridge the current gap, this paper proposes ViLaIn-TAMP, a hybrid planning framework for enabling verifiable, interpretable, and autonomous robot behaviors. ViLaIn-TAMP comprises three main components: (1) a Vision-Language Interpreter (ViLaIn) adapted from previous work that converts multimodal inputs into structured problem specifications, (2) a modular Task and Motion Planning (TAMP) system that grounds these specifications in actionable trajectory sequences through symbolic and geometric constraint reasoning, and (3) a corrective planning (CP) module which receives concrete feedback on failed solution attempts and feed them with constraints back to ViLaIn to refine the specification. We design challenging manipulation tasks in a cooking domain and evaluate our framework. Experimental results demonstrate that ViLaIn-TAMP outperforms a VLM-as-a-planner baseline by 18% in mean success rate, and that adding the CP module boosts mean success rate by 32%.

Grounded Vision-Language Interpreter for Integrated Task and Motion Planning

TL;DR

ViLaIn-TAMP addresses safety, interpretability, and robustness in vision-language guided planning for long-horizon manipulation. It combines ViLaIn-based PDDL problem generation, a TAMP pipeline that couples symbolic planning with geometric grounding via MoveIt Task Constructor, and a corrective planning loop that uses grounded failure feedback to replan up to a defined maximum. In cooking-domain tasks, it outperforms a VLM-as-planner baseline by 18% in mean success rate, and the CP module boosts success by 32%, with validation on real robots. This work advances interpretable, verifiable robot planning by integrating symbolic verification, grounded failure analysis, and real-world execution into a unified framework.

Abstract

While recent advances in vision-language models have accelerated the development of language-guided robot planners, their black-box nature often lacks safety guarantees and interpretability crucial for real-world deployment. Conversely, classical symbolic planners offer rigorous safety verification but require significant expert knowledge for setup. To bridge the current gap, this paper proposes ViLaIn-TAMP, a hybrid planning framework for enabling verifiable, interpretable, and autonomous robot behaviors. ViLaIn-TAMP comprises three main components: (1) a Vision-Language Interpreter (ViLaIn) adapted from previous work that converts multimodal inputs into structured problem specifications, (2) a modular Task and Motion Planning (TAMP) system that grounds these specifications in actionable trajectory sequences through symbolic and geometric constraint reasoning, and (3) a corrective planning (CP) module which receives concrete feedback on failed solution attempts and feed them with constraints back to ViLaIn to refine the specification. We design challenging manipulation tasks in a cooking domain and evaluate our framework. Experimental results demonstrate that ViLaIn-TAMP outperforms a VLM-as-a-planner baseline by 18% in mean success rate, and that adding the CP module boosts mean success rate by 32%.

Paper Structure

This paper contains 18 sections, 10 figures, 1 table, 1 algorithm.

Figures (10)

  • Figure 1: We develop ViLaIn-TAMP, a novel end-to-end planning framework for long-horizon manipulation that 1) converts multimodal inputs into PDDL problems, 2) finds feasible motion plans via an integrated TAMP system, and 3) reasons over detailed failure feedback to revise and replan using corrective planning. ViLaIn-TAMP is capable of solving real-world, long-horizon bimanual cooking tasks.
  • Figure 2: Overview of ViLaIn-TAMP framework. In the ViLaIn part (A), given a linguistic instruction in natural language and an image as a scene observation, the ViLaIn module generates a complete PDDL problem. In the TAMP part (B), the generated PDDL problem is passed to the integrated TAMP module, which solves the problem for a sequence of symbolic actions and collision-free motion trajectories. If successful, the complete plan is executed on the robot; otherwise, corrective planning is performed, where failures are re-prompted back to ViLaIn for revision and replanning.
  • Figure 3: Overview of Corrective Planning Module. ViLaIn-TAMP implements a 3-step corrective planning (CP) approach, which involves 1) re-prompting the model with the failure feedback to 2) revise the PDDL problem, and then 3) replanning using the revised PDDL problem.
  • Figure 4: Example of Motion Planning Failure Feedback and Visualization in RViz. In MTC, both successful and failed motion plans are published and can be visualized in RViz before actual execution, allowing human introspection. Our custom MTC implementation extracts these failures into natural language for easier VLM reasoning.
  • Figure 5: Comparison of ViLaIn-TAMP and the baseline, evaluating their performance on five cooking tasks with and without corrective planning (CP). The maximum number of CP attempts is set to $3$. ViLaIn-TAMP consistently outperforms the baseline in all tasks. CP is effective in both models, consistently improving the success rate by a large margin.
  • ...and 5 more figures