Table of Contents
Fetching ...

Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation

Qingxuan Jia, Guoqin Tang, Zeyuan Huang, Zixuan Hao, Ning Ji, Shihang, Yin, Gang Chen

TL;DR

This work tackles the challenge of achieving millimeter-level precision in VLM-guided robotic manipulation by bridging high-level vision-language reasoning with exact spatial control. It introduces a progressive planning algorithm that couples a dual-layer spatial-semantic fusion (2D topology graphs and 3D spatial networks) with a task-memory-driven, adaptive prompting mechanism. The method unfolds in three stages—spatial-semantic mapping, scene understanding, and task-oriented VLM interaction—while maintaining real-time feedback loops and two execution modes (coarse and fine) guided by a proximity threshold. Extensive experiments on complex assembly tasks show substantial improvements over pure VLM and baseline approaches, including high task success and robust grounding, signaling strong practical potential for modular, cloud-to-edge robotic systems with reduced data requirements and improved resilience to cognitive limits.

Abstract

Vision-Language Models (VLMs) demonstrate remarkable potential in robotic manipulation, yet challenges persist in executing complex fine manipulation tasks with high speed and precision. While excelling at high-level planning, existing VLM methods struggle to guide robots through precise sequences of fine motor actions. To address this limitation, we introduce a progressive VLM planning algorithm that empowers robots to perform fast, precise, and error-correctable fine manipulation. Our method decomposes complex tasks into sub-actions and maintains three key data structures: task memory structure, 2D topology graphs, and 3D spatial networks, achieving high-precision spatial-semantic fusion. These three components collectively accumulate and store critical information throughout task execution, providing rich context for our task-oriented VLM interaction mechanism. This enables VLMs to dynamically adjust guidance based on real-time feedback, generating precise action plans and facilitating step-wise error correction. Experimental validation on complex assembly tasks demonstrates that our algorithm effectively guides robots to rapidly and precisely accomplish fine manipulation in challenging scenarios, significantly advancing robot intelligence for precision tasks.

Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation

TL;DR

This work tackles the challenge of achieving millimeter-level precision in VLM-guided robotic manipulation by bridging high-level vision-language reasoning with exact spatial control. It introduces a progressive planning algorithm that couples a dual-layer spatial-semantic fusion (2D topology graphs and 3D spatial networks) with a task-memory-driven, adaptive prompting mechanism. The method unfolds in three stages—spatial-semantic mapping, scene understanding, and task-oriented VLM interaction—while maintaining real-time feedback loops and two execution modes (coarse and fine) guided by a proximity threshold. Extensive experiments on complex assembly tasks show substantial improvements over pure VLM and baseline approaches, including high task success and robust grounding, signaling strong practical potential for modular, cloud-to-edge robotic systems with reduced data requirements and improved resilience to cognitive limits.

Abstract

Vision-Language Models (VLMs) demonstrate remarkable potential in robotic manipulation, yet challenges persist in executing complex fine manipulation tasks with high speed and precision. While excelling at high-level planning, existing VLM methods struggle to guide robots through precise sequences of fine motor actions. To address this limitation, we introduce a progressive VLM planning algorithm that empowers robots to perform fast, precise, and error-correctable fine manipulation. Our method decomposes complex tasks into sub-actions and maintains three key data structures: task memory structure, 2D topology graphs, and 3D spatial networks, achieving high-precision spatial-semantic fusion. These three components collectively accumulate and store critical information throughout task execution, providing rich context for our task-oriented VLM interaction mechanism. This enables VLMs to dynamically adjust guidance based on real-time feedback, generating precise action plans and facilitating step-wise error correction. Experimental validation on complex assembly tasks demonstrates that our algorithm effectively guides robots to rapidly and precisely accomplish fine manipulation in challenging scenarios, significantly advancing robot intelligence for precision tasks.

Paper Structure

This paper contains 55 sections, 23 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Progressive VLM Planning Algorithm Architecture. Top-Left: Stage 1 - Spatial Representation and Down-sampling. Stage 1 takes RGB-D images and task descriptions as input. It performs initial VLM interaction for panoptic segmentation, registering it with depth point clouds to obtain spatial representations. Top-Right: Stage 2 - Spatial Relationship Analysis and Memorization. Stage 2 constructs and maintains task sequence memory, topology graph, and spatial network based on registered point clouds for spatiotemporal scene understanding. Bottom-Right: Stage 3 - Task-Oriented VLM Interaction. Stage 3 synthesizes VLM prompts using task memory, topology graph, and spatial network information. These prompts guide the VLM to generate ROS functions for robot subtask execution. Bottom-Left: Robot Execution. A real-world image depicting the robot arm executing a subtask based on the ROS functions. Feedback Loop: Feedback Loop: After each subtask, the robot re-observes the environment to obtain scene information and task status, providing input for the next iteration in a closed-loop system.
  • Figure 2: Segmentation and Mapping
  • Figure 3: Simulation Space
  • Figure 4: Task Memory
  • Figure 5: Topology Graph
  • ...and 1 more figures