Vision-Language Interpreter for Robot Task Planning
Keisuke Shirai, Cristian C. Beltran-Hernandez, Masashi Hamaya, Atsushi Hashimoto, Shohei Tanaka, Kento Kawaharazuka, Kazutoshi Tanaka, Yoshitaka Ushiku, Shinsuke Mori
TL;DR
This work introduces ViLaIn, a Vision-Language Interpreter that converts linguistic instructions and scene observations into machine-readable PDs $P=(O,I,G)$ in $PDDL$ to drive symbolic planning, addressing interpretability gaps in language-guided robot planning. It combines Grounding-DINO for object detection, BLIP-2 for captions, and GPT-4 with few-shot prompts to generate $O$, $I$, and $G$, with a corrective re-prompting mechanism (CR) and optional Chain-of-Thought (CoT) reasoning to refine PDs based on planner feedback. The authors present the ProDG dataset across Cooking, Blocksworld, and Hanoi domains, along with new evaluation metrics $ ext{R}_ ext{syntax}$, $ ext{R}_ ext{plan}$, $ ext{R}_ ext{part}$, and $ ext{R}_ ext{all}$. Experimental results show high syntactic accuracy ($>99 ext%)$, strong plan validity in Cooking and Blocksworld (≥94%), and more challenging performance in Hanoi (≈58%), highlighting the value of CR and CoT in improving PD quality. The work presents a significant step toward interpretable, language-guided robotic planning with a learnable PD-generation pipeline and a versatile open-domain dataset.
Abstract
Large language models (LLMs) are accelerating the development of language-guided robot planners. Meanwhile, symbolic planners offer the advantage of interpretability. This paper proposes a new task that bridges these two trends, namely, multimodal planning problem specification. The aim is to generate a problem description (PD), a machine-readable file used by the planners to find a plan. By generating PDs from language instruction and scene observation, we can drive symbolic planners in a language-guided framework. We propose a Vision-Language Interpreter (ViLaIn), a new framework that generates PDs using state-of-the-art LLM and vision-language models. ViLaIn can refine generated PDs via error message feedback from the symbolic planner. Our aim is to answer the question: How accurately can ViLaIn and the symbolic planner generate valid robot plans? To evaluate ViLaIn, we introduce a novel dataset called the problem description generation (ProDG) dataset. The framework is evaluated with four new evaluation metrics. Experimental results show that ViLaIn can generate syntactically correct problems with more than 99\% accuracy and valid plans with more than 58\% accuracy. Our code and dataset are available at https://github.com/omron-sinicx/ViLaIn.
