Bridging VLM and KMP: Enabling Fine-grained robotic manipulation via Semantic Keypoints Representation
Junjie Zhu, Huayu Liu, Jin Wang, Bangrong Wen, Kaixiang Huang, Xiaofei Li, Haiyun Zhan, Guodong Lu
TL;DR
This work tackles the problem of achieving fine-grained robotic manipulation under ambiguity by bridging Vision-Language Models with Kernelized Movement Primitives. It introduces VL-MP, a framework that uses a Bridge Layer to convert VLM decision outputs into 3D semantic keypoints and a Local Feature Enhanced KMP (LFE-KMP) to preserve trajectory shapes during generalization. Through real-world pouring tasks and shape-preservation benchmarks, VL-MP demonstrates superior task parameter transfer and trajectory fidelity compared to baselines, enabling robust one-shot generalization in complex environments. The approach advances open-set decision-making in robotics by integrating high-level semantic reasoning with precise low-level motion generalization, with potential for extended whole-arm planning in the future.
Abstract
From early Movement Primitive (MP) techniques to modern Vision-Language Models (VLMs), autonomous manipulation has remained a pivotal topic in robotics. As two extremes, VLM-based methods emphasize zero-shot and adaptive manipulation but struggle with fine-grained planning. In contrast, MP-based approaches excel in precise trajectory generalization but lack decision-making ability. To leverage the strengths of the two frameworks, we propose VL-MP, which integrates VLM with Kernelized Movement Primitives (KMP) via a low-distortion decision information transfer bridge, enabling fine-grained robotic manipulation under ambiguous situations. One key of VL-MP is the accurate representation of task decision parameters through semantic keypoints constraints, leading to more precise task parameter generation. Additionally, we introduce a local trajectory feature-enhanced KMP to support VL-MP, thereby achieving shape preservation for complex trajectories. Extensive experiments conducted in complex real-world environments validate the effectiveness of VL-MP for adaptive and fine-grained manipulation.
