Table of Contents
Fetching ...

CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World

Yankai Fu, Qiuxuan Feng, Ning Chen, Zichen Zhou, Mengzhen Liu, Mingdong Wu, Tianxing Chen, Shanyu Rong, Jiaming Liu, Hao Dong, Shanghang Zhang

TL;DR

The paper tackles the challenge of dexterous real-world manipulation under occlusions and imperfect 3D perception by introducing CordViP, a correspondence-based visuomotor policy. It constructs interaction-aware 3D observations from robust $6D$ object pose estimates and robot proprioception, pretrains an encoder with object-centric contact maps and hand-arm coordination, and then trains a diffusion-based policy conditioned on these features. Across six real-world tasks, CordViP achieves state-of-the-art performance with high sample efficiency and strong generalization to unseen objects, lighting, and viewpoints, while maintaining efficient inference. This work advances practical 3D-based manipulation by emphasizing spatial-temporal correspondences and coordination cues, offering a scalable path toward reliable dexterous control in unstructured environments.

Abstract

Achieving human-level dexterity in robots is a key objective in the field of robotic manipulation. Recent advancements in 3D-based imitation learning have shown promising results, providing an effective pathway to achieve this goal. However, obtaining high-quality 3D representations presents two key problems: (1) the quality of point clouds captured by a single-view camera is significantly affected by factors such as camera resolution, positioning, and occlusions caused by the dexterous hand; (2) the global point clouds lack crucial contact information and spatial correspondences, which are necessary for fine-grained dexterous manipulation tasks. To eliminate these limitations, we propose CordViP, a novel framework that constructs and learns correspondences by leveraging the robust 6D pose estimation of objects and robot proprioception. Specifically, we first introduce the interaction-aware point clouds, which establish correspondences between the object and the hand. These point clouds are then used for our pre-training policy, where we also incorporate object-centric contact maps and hand-arm coordination information, effectively capturing both spatial and temporal dynamics. Our method demonstrates exceptional dexterous manipulation capabilities, achieving state-of-the-art performance in six real-world tasks, surpassing other baselines by a large margin. Experimental results also highlight the superior generalization and robustness of CordViP to different objects, viewpoints, and scenarios. Code and videos are available on https://aureleopku.github.io/CordViP.

CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World

TL;DR

The paper tackles the challenge of dexterous real-world manipulation under occlusions and imperfect 3D perception by introducing CordViP, a correspondence-based visuomotor policy. It constructs interaction-aware 3D observations from robust object pose estimates and robot proprioception, pretrains an encoder with object-centric contact maps and hand-arm coordination, and then trains a diffusion-based policy conditioned on these features. Across six real-world tasks, CordViP achieves state-of-the-art performance with high sample efficiency and strong generalization to unseen objects, lighting, and viewpoints, while maintaining efficient inference. This work advances practical 3D-based manipulation by emphasizing spatial-temporal correspondences and coordination cues, offering a scalable path toward reliable dexterous control in unstructured environments.

Abstract

Achieving human-level dexterity in robots is a key objective in the field of robotic manipulation. Recent advancements in 3D-based imitation learning have shown promising results, providing an effective pathway to achieve this goal. However, obtaining high-quality 3D representations presents two key problems: (1) the quality of point clouds captured by a single-view camera is significantly affected by factors such as camera resolution, positioning, and occlusions caused by the dexterous hand; (2) the global point clouds lack crucial contact information and spatial correspondences, which are necessary for fine-grained dexterous manipulation tasks. To eliminate these limitations, we propose CordViP, a novel framework that constructs and learns correspondences by leveraging the robust 6D pose estimation of objects and robot proprioception. Specifically, we first introduce the interaction-aware point clouds, which establish correspondences between the object and the hand. These point clouds are then used for our pre-training policy, where we also incorporate object-centric contact maps and hand-arm coordination information, effectively capturing both spatial and temporal dynamics. Our method demonstrates exceptional dexterous manipulation capabilities, achieving state-of-the-art performance in six real-world tasks, surpassing other baselines by a large margin. Experimental results also highlight the superior generalization and robustness of CordViP to different objects, viewpoints, and scenarios. Code and videos are available on https://aureleopku.github.io/CordViP.

Paper Structure

This paper contains 23 sections, 7 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: We propose CordViP, a correspondence-based visuomotor policy for dexterous manipulation in the real world. (a) Left: We present the interaction-aware point clouds, which demonstrate robustness to different viewpoints while establishing correspondences between the object and the hand. (b) Right: Our method achieves promising results across multiple real-world dexterous manipulation tasks, showcasing exceptional generalization capabilities.
  • Figure 2: Overview Framework (a) We first employ TripoSR to generate the initial object point cloud and FoundationPose to estimate the 6D pose of the object. In parallel, the hand point cloud is generated based on the robot's state. They are combined to construct interaction-aware point clouds, which demonstrate robustness to viewpoint variations. (b) During the pre-training phase, the generated point cloud data, combined with the robot’s proprioceptive information, is utilized to enhance spatial understanding and interaction modeling. (c) The pre-trained encoder is subsequently integrated into an imitation learning framework to facilitate downstream tasks in dexterous manipulation.
  • Figure 3: Point Clouds Comparison. We present point clouds of two methods under three different viewpoints. Notably, for better visualization, we have applied color information to the point clouds. However, color information is not used in the policy learning.
  • Figure 4: Real robot system. Our system consists of a Leap Hand and a UR5 Arm, with a fixed Realsense L515 camera employed to capture visual observation. The Realsense D435 camera is only used for data collection during teleoperation and is not involved in the policy learning.
  • Figure 5: Visualization of six dexterous manipulation tasks, with the right side showing the end state.
  • ...and 11 more figures