CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World

Yankai Fu; Qiuxuan Feng; Ning Chen; Zichen Zhou; Mengzhen Liu; Mingdong Wu; Tianxing Chen; Shanyu Rong; Jiaming Liu; Hao Dong; Shanghang Zhang

CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World

Yankai Fu, Qiuxuan Feng, Ning Chen, Zichen Zhou, Mengzhen Liu, Mingdong Wu, Tianxing Chen, Shanyu Rong, Jiaming Liu, Hao Dong, Shanghang Zhang

TL;DR

The paper tackles the challenge of dexterous real-world manipulation under occlusions and imperfect 3D perception by introducing CordViP, a correspondence-based visuomotor policy. It constructs interaction-aware 3D observations from robust $6D$ object pose estimates and robot proprioception, pretrains an encoder with object-centric contact maps and hand-arm coordination, and then trains a diffusion-based policy conditioned on these features. Across six real-world tasks, CordViP achieves state-of-the-art performance with high sample efficiency and strong generalization to unseen objects, lighting, and viewpoints, while maintaining efficient inference. This work advances practical 3D-based manipulation by emphasizing spatial-temporal correspondences and coordination cues, offering a scalable path toward reliable dexterous control in unstructured environments.

Abstract

Achieving human-level dexterity in robots is a key objective in the field of robotic manipulation. Recent advancements in 3D-based imitation learning have shown promising results, providing an effective pathway to achieve this goal. However, obtaining high-quality 3D representations presents two key problems: (1) the quality of point clouds captured by a single-view camera is significantly affected by factors such as camera resolution, positioning, and occlusions caused by the dexterous hand; (2) the global point clouds lack crucial contact information and spatial correspondences, which are necessary for fine-grained dexterous manipulation tasks. To eliminate these limitations, we propose CordViP, a novel framework that constructs and learns correspondences by leveraging the robust 6D pose estimation of objects and robot proprioception. Specifically, we first introduce the interaction-aware point clouds, which establish correspondences between the object and the hand. These point clouds are then used for our pre-training policy, where we also incorporate object-centric contact maps and hand-arm coordination information, effectively capturing both spatial and temporal dynamics. Our method demonstrates exceptional dexterous manipulation capabilities, achieving state-of-the-art performance in six real-world tasks, surpassing other baselines by a large margin. Experimental results also highlight the superior generalization and robustness of CordViP to different objects, viewpoints, and scenarios. Code and videos are available on https://aureleopku.github.io/CordViP.

CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World

TL;DR

Abstract

CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)