Table of Contents
Fetching ...

Bridging VLM and KMP: Enabling Fine-grained robotic manipulation via Semantic Keypoints Representation

Junjie Zhu, Huayu Liu, Jin Wang, Bangrong Wen, Kaixiang Huang, Xiaofei Li, Haiyun Zhan, Guodong Lu

TL;DR

This work tackles the problem of achieving fine-grained robotic manipulation under ambiguity by bridging Vision-Language Models with Kernelized Movement Primitives. It introduces VL-MP, a framework that uses a Bridge Layer to convert VLM decision outputs into 3D semantic keypoints and a Local Feature Enhanced KMP (LFE-KMP) to preserve trajectory shapes during generalization. Through real-world pouring tasks and shape-preservation benchmarks, VL-MP demonstrates superior task parameter transfer and trajectory fidelity compared to baselines, enabling robust one-shot generalization in complex environments. The approach advances open-set decision-making in robotics by integrating high-level semantic reasoning with precise low-level motion generalization, with potential for extended whole-arm planning in the future.

Abstract

From early Movement Primitive (MP) techniques to modern Vision-Language Models (VLMs), autonomous manipulation has remained a pivotal topic in robotics. As two extremes, VLM-based methods emphasize zero-shot and adaptive manipulation but struggle with fine-grained planning. In contrast, MP-based approaches excel in precise trajectory generalization but lack decision-making ability. To leverage the strengths of the two frameworks, we propose VL-MP, which integrates VLM with Kernelized Movement Primitives (KMP) via a low-distortion decision information transfer bridge, enabling fine-grained robotic manipulation under ambiguous situations. One key of VL-MP is the accurate representation of task decision parameters through semantic keypoints constraints, leading to more precise task parameter generation. Additionally, we introduce a local trajectory feature-enhanced KMP to support VL-MP, thereby achieving shape preservation for complex trajectories. Extensive experiments conducted in complex real-world environments validate the effectiveness of VL-MP for adaptive and fine-grained manipulation.

Bridging VLM and KMP: Enabling Fine-grained robotic manipulation via Semantic Keypoints Representation

TL;DR

This work tackles the problem of achieving fine-grained robotic manipulation under ambiguity by bridging Vision-Language Models with Kernelized Movement Primitives. It introduces VL-MP, a framework that uses a Bridge Layer to convert VLM decision outputs into 3D semantic keypoints and a Local Feature Enhanced KMP (LFE-KMP) to preserve trajectory shapes during generalization. Through real-world pouring tasks and shape-preservation benchmarks, VL-MP demonstrates superior task parameter transfer and trajectory fidelity compared to baselines, enabling robust one-shot generalization in complex environments. The approach advances open-set decision-making in robotics by integrating high-level semantic reasoning with precise low-level motion generalization, with potential for extended whole-arm planning in the future.

Abstract

From early Movement Primitive (MP) techniques to modern Vision-Language Models (VLMs), autonomous manipulation has remained a pivotal topic in robotics. As two extremes, VLM-based methods emphasize zero-shot and adaptive manipulation but struggle with fine-grained planning. In contrast, MP-based approaches excel in precise trajectory generalization but lack decision-making ability. To leverage the strengths of the two frameworks, we propose VL-MP, which integrates VLM with Kernelized Movement Primitives (KMP) via a low-distortion decision information transfer bridge, enabling fine-grained robotic manipulation under ambiguous situations. One key of VL-MP is the accurate representation of task decision parameters through semantic keypoints constraints, leading to more precise task parameter generation. Additionally, we introduce a local trajectory feature-enhanced KMP to support VL-MP, thereby achieving shape preservation for complex trajectories. Extensive experiments conducted in complex real-world environments validate the effectiveness of VL-MP for adaptive and fine-grained manipulation.

Paper Structure

This paper contains 13 sections, 10 equations, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: Illustration of the dilemma of current robotic manipulation, without the precise guidance of VLM decision information, KMP methods struggle to generalize fine-grained tasks, such as pouring water into a specified cup.
  • Figure 2: Overview of VL-MP. In an ambiguous task, we first employ the VLM to process environmental and language inputs for task decision-making. Subsequently, the proposed Bridge Layer extracts 3D keypoints and provides an accurate representation of the task parameters, which are then transmitted downward without distortion. Finally, at the LFE-KMP stage, an accurate generalization of one-shot tasks is achieved through KMP modeling by resampling the local features of the demonstration trajectory.
  • Figure 3: Illustration of $[DemoTraj.]$ extraction. Keypoints are extracted from the skill video stream. In each frame, a local coordinate system is constructed based on the keypoints of the manipulated object to capture its pose. The last frame detection result is used as the interaction state of the skill.
  • Figure 4: The diagram of keypoints detection network. The keypoints detection network utilizes HRNet for multi-scale feature extraction from images. Self-attention is then applied to further enhance the global dependencies of the features. Finally, three output heads are employed to predict the center points, keypoints, and depth.
  • Figure 5: Illustration of the keypoints definition and task constraint. (a) Describes the definition of 3D keypoints. (b) Demonstrates keypoint normalization and calibration across instances within the same class, aimed at more accurate task parameter representation. (c) Describes the construction of the termination task posture, used for subsequent sampling and estimation of the termination task pose. (d) (e) Illustrate the derivation process of the normalized calibration interaction keypoints through the x-y plane projection of keypoints.
  • ...and 6 more figures