Rethinking Intermediate Representation for VLM-based Robot Manipulation

Weiliang Tang; Jialin Gao; Jia-Hui Pan; Gang Wang; Li Erran Li; Yunhui Liu; Mingyu Ding; Pheng-Ann Heng; Chi-Wing Fu

Rethinking Intermediate Representation for VLM-based Robot Manipulation

Weiliang Tang, Jialin Gao, Jia-Hui Pan, Gang Wang, Li Erran Li, Yunhui Liu, Mingyu Ding, Pheng-Ann Heng, Chi-Wing Fu

TL;DR

SEAM addresses a core gap in VLM-based robot manipulation by splitting the intermediate representation into a semantically rich vocabulary and a CFG-style grammar, enabling both VLM comprehension and task generalization. The approach is augmented with a Retrieval-Augmented Generation segmentation pipeline for fine-grained object parts and two metrics—Action-Generalizability and VLM-Comprehensibility—to quantify scalability and understanding. Empirical results show SEAM outperforms baselines by about 15% in real-world tasks and demonstrates robust handling of diverse and unseen manipulations. The work thus provides a practical, scalable bridge between language-guided reasoning and actionable robotic control with real-time capabilities.

Abstract

Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar. Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. In addition, we design a new open-vocabulary segmentation paradigm with a retrieval-augmented few-shot learning strategy to localize fine-grained object parts for manipulation, effectively with the shortest inference time over all state-of-the-art parallel works. Also, we formulate new metrics for action-generalizability and VLM-comprehensibility, demonstrating the compelling performance of SEAM over mainstream representations on both aspects. Extensive real-world experiments further manifest its SOTA performance under varying settings and tasks.

Rethinking Intermediate Representation for VLM-based Robot Manipulation

TL;DR

Abstract

Rethinking Intermediate Representation for VLM-based Robot Manipulation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)