Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation

Qingxuan Jia; Guoqin Tang; Zeyuan Huang; Zixuan Hao; Ning Ji; Shihang; Yin; Gang Chen

Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation

Qingxuan Jia, Guoqin Tang, Zeyuan Huang, Zixuan Hao, Ning Ji, Shihang, Yin, Gang Chen

TL;DR

This work tackles the challenge of achieving millimeter-level precision in VLM-guided robotic manipulation by bridging high-level vision-language reasoning with exact spatial control. It introduces a progressive planning algorithm that couples a dual-layer spatial-semantic fusion (2D topology graphs and 3D spatial networks) with a task-memory-driven, adaptive prompting mechanism. The method unfolds in three stages—spatial-semantic mapping, scene understanding, and task-oriented VLM interaction—while maintaining real-time feedback loops and two execution modes (coarse and fine) guided by a proximity threshold. Extensive experiments on complex assembly tasks show substantial improvements over pure VLM and baseline approaches, including high task success and robust grounding, signaling strong practical potential for modular, cloud-to-edge robotic systems with reduced data requirements and improved resilience to cognitive limits.

Abstract

Vision-Language Models (VLMs) demonstrate remarkable potential in robotic manipulation, yet challenges persist in executing complex fine manipulation tasks with high speed and precision. While excelling at high-level planning, existing VLM methods struggle to guide robots through precise sequences of fine motor actions. To address this limitation, we introduce a progressive VLM planning algorithm that empowers robots to perform fast, precise, and error-correctable fine manipulation. Our method decomposes complex tasks into sub-actions and maintains three key data structures: task memory structure, 2D topology graphs, and 3D spatial networks, achieving high-precision spatial-semantic fusion. These three components collectively accumulate and store critical information throughout task execution, providing rich context for our task-oriented VLM interaction mechanism. This enables VLMs to dynamically adjust guidance based on real-time feedback, generating precise action plans and facilitating step-wise error correction. Experimental validation on complex assembly tasks demonstrates that our algorithm effectively guides robots to rapidly and precisely accomplish fine manipulation in challenging scenarios, significantly advancing robot intelligence for precision tasks.

Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation

TL;DR

Abstract

Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)