Table of Contents
Fetching ...

Rethinking Intermediate Representation for VLM-based Robot Manipulation

Weiliang Tang, Jialin Gao, Jia-Hui Pan, Gang Wang, Li Erran Li, Yunhui Liu, Mingyu Ding, Pheng-Ann Heng, Chi-Wing Fu

TL;DR

SEAM addresses a core gap in VLM-based robot manipulation by splitting the intermediate representation into a semantically rich vocabulary and a CFG-style grammar, enabling both VLM comprehension and task generalization. The approach is augmented with a Retrieval-Augmented Generation segmentation pipeline for fine-grained object parts and two metrics—Action-Generalizability and VLM-Comprehensibility—to quantify scalability and understanding. Empirical results show SEAM outperforms baselines by about 15% in real-world tasks and demonstrates robust handling of diverse and unseen manipulations. The work thus provides a practical, scalable bridge between language-guided reasoning and actionable robotic control with real-time capabilities.

Abstract

Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar. Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. In addition, we design a new open-vocabulary segmentation paradigm with a retrieval-augmented few-shot learning strategy to localize fine-grained object parts for manipulation, effectively with the shortest inference time over all state-of-the-art parallel works. Also, we formulate new metrics for action-generalizability and VLM-comprehensibility, demonstrating the compelling performance of SEAM over mainstream representations on both aspects. Extensive real-world experiments further manifest its SOTA performance under varying settings and tasks.

Rethinking Intermediate Representation for VLM-based Robot Manipulation

TL;DR

SEAM addresses a core gap in VLM-based robot manipulation by splitting the intermediate representation into a semantically rich vocabulary and a CFG-style grammar, enabling both VLM comprehension and task generalization. The approach is augmented with a Retrieval-Augmented Generation segmentation pipeline for fine-grained object parts and two metrics—Action-Generalizability and VLM-Comprehensibility—to quantify scalability and understanding. Empirical results show SEAM outperforms baselines by about 15% in real-world tasks and demonstrates robust handling of diverse and unseen manipulations. The work thus provides a practical, scalable bridge between language-guided reasoning and actionable robotic control with real-time capabilities.

Abstract

Vision-Language Model (VLM) is an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar. Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. In addition, we design a new open-vocabulary segmentation paradigm with a retrieval-augmented few-shot learning strategy to localize fine-grained object parts for manipulation, effectively with the shortest inference time over all state-of-the-art parallel works. Also, we formulate new metrics for action-generalizability and VLM-comprehensibility, demonstrating the compelling performance of SEAM over mainstream representations on both aspects. Extensive real-world experiments further manifest its SOTA performance under varying settings and tasks.

Paper Structure

This paper contains 30 sections, 3 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Comparing (a) high-level, (c) low-level, and (b) our SEAM (Semantic Assembly) representation for supporting robot manipulation. High-level representation requires manually adding new vocabulary words to customize the model for new tasks, despite its VLM-comprehensibility, whereas low-level representation requires generating complex constraints in task handling, despite its generalizability in robot actions. Our new SEAM design meets the goals of both VLM-comprehensibility and action-generalizability.
  • Figure 1: The sequences of execution for the eight real-world tasks.
  • Figure 2: Overall pipeline of our method. Given the current observation and the task instruction, our method first generates the (a) Semantic Assembly Representation (SEAM) with designed vocabulary and grammar, and then (b) translated into an intermediate representation. Next, we retrieve the corresponding support images and support masks from (c) the Retrieval Augmented Generation (RAG) Database and (d) segment the target object parts in the scene. Finally, we solve the gripper's trajectories for (e) robotic execution.
  • Figure 2: The initial scene settings.
  • Figure 3: Qualitative performance comparisons for open vocabulary segmentation between the state-of-the-art methods and our methods on common manipulation.
  • ...and 8 more figures