Vision-Language Interpreter for Robot Task Planning

Keisuke Shirai; Cristian C. Beltran-Hernandez; Masashi Hamaya; Atsushi Hashimoto; Shohei Tanaka; Kento Kawaharazuka; Kazutoshi Tanaka; Yoshitaka Ushiku; Shinsuke Mori

Vision-Language Interpreter for Robot Task Planning

Keisuke Shirai, Cristian C. Beltran-Hernandez, Masashi Hamaya, Atsushi Hashimoto, Shohei Tanaka, Kento Kawaharazuka, Kazutoshi Tanaka, Yoshitaka Ushiku, Shinsuke Mori

TL;DR

This work introduces ViLaIn, a Vision-Language Interpreter that converts linguistic instructions and scene observations into machine-readable PDs $P=(O,I,G)$ in $PDDL$ to drive symbolic planning, addressing interpretability gaps in language-guided robot planning. It combines Grounding-DINO for object detection, BLIP-2 for captions, and GPT-4 with few-shot prompts to generate $O$, $I$, and $G$, with a corrective re-prompting mechanism (CR) and optional Chain-of-Thought (CoT) reasoning to refine PDs based on planner feedback. The authors present the ProDG dataset across Cooking, Blocksworld, and Hanoi domains, along with new evaluation metrics $ ext{R}_ ext{syntax}$, $ ext{R}_ ext{plan}$, $ ext{R}_ ext{part}$, and $ ext{R}_ ext{all}$. Experimental results show high syntactic accuracy ($>99 ext%)$, strong plan validity in Cooking and Blocksworld (≥94%), and more challenging performance in Hanoi (≈58%), highlighting the value of CR and CoT in improving PD quality. The work presents a significant step toward interpretable, language-guided robotic planning with a learnable PD-generation pipeline and a versatile open-domain dataset.

Abstract

Large language models (LLMs) are accelerating the development of language-guided robot planners. Meanwhile, symbolic planners offer the advantage of interpretability. This paper proposes a new task that bridges these two trends, namely, multimodal planning problem specification. The aim is to generate a problem description (PD), a machine-readable file used by the planners to find a plan. By generating PDs from language instruction and scene observation, we can drive symbolic planners in a language-guided framework. We propose a Vision-Language Interpreter (ViLaIn), a new framework that generates PDs using state-of-the-art LLM and vision-language models. ViLaIn can refine generated PDs via error message feedback from the symbolic planner. Our aim is to answer the question: How accurately can ViLaIn and the symbolic planner generate valid robot plans? To evaluate ViLaIn, we introduce a novel dataset called the problem description generation (ProDG) dataset. The framework is evaluated with four new evaluation metrics. Experimental results show that ViLaIn can generate syntactically correct problems with more than 99\% accuracy and valid plans with more than 58\% accuracy. Our code and dataset are available at https://github.com/omron-sinicx/ViLaIn.

Vision-Language Interpreter for Robot Task Planning

TL;DR

This work introduces ViLaIn, a Vision-Language Interpreter that converts linguistic instructions and scene observations into machine-readable PDs

to drive symbolic planning, addressing interpretability gaps in language-guided robot planning. It combines Grounding-DINO for object detection, BLIP-2 for captions, and GPT-4 with few-shot prompts to generate

, and

, with a corrective re-prompting mechanism (CR) and optional Chain-of-Thought (CoT) reasoning to refine PDs based on planner feedback. The authors present the ProDG dataset across Cooking, Blocksworld, and Hanoi domains, along with new evaluation metrics

, and

. Experimental results show high syntactic accuracy (

, strong plan validity in Cooking and Blocksworld (≥94%), and more challenging performance in Hanoi (≈58%), highlighting the value of CR and CoT in improving PD quality. The work presents a significant step toward interpretable, language-guided robotic planning with a learnable PD-generation pipeline and a versatile open-domain dataset.

Abstract

Paper Structure (19 sections, 7 figures, 5 tables)

This paper contains 19 sections, 7 figures, 5 tables.

INTRODUCTION
RELATED WORK
Planning from Natural Language
Symbolic Planning with PDDL
Scene Recognition for Planning Problem Specification
PROBLEM STATEMENT
Vision-Language Interpreter
Object Estimator
Initial State Estimator
Goal Estimator
Corrective Re-Prompting
Dataset
Evaluation Metrics
Experiments
Generation Settings of ViLaIn
...and 4 more sections

Figures (7)

Figure 1: Overview of our approach. The vision-language interpreter (ViLaIn) generates a problem description from a linguistic instruction and scene observation. The symbolic planner finds an optimal plan from the generated problem description.
Figure 2: The open-vocabulary object detector detects objects from the observation. The text query is provided by the domain knowledge. The detected objects are converted into a PDDL format in a rule-based way.
Figure 3: The captioning model generates captions for each object. The LLM generates the PDDL initial state from the bounding boxes and the captions using few-shot prompting.
Figure 4: The LLM directly generates the PDDL goal specification from the instruction and the PDDL objects and initial state using few-shot prompting.
Figure 5: ViLaIn can refine the generated problem description via an error message from the planner.
...and 2 more figures

Vision-Language Interpreter for Robot Task Planning

TL;DR

Abstract

Vision-Language Interpreter for Robot Task Planning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)