Table of Contents
Fetching ...

TransDex: Pre-training Visuo-Tactile Policy with Point Cloud Reconstruction for Dexterous Manipulation of Transparent Objects

Fengguan Li, Yifan Ma, Chen Qian, Wentao Rao, Weiwei Shang

Abstract

Dexterous manipulation enables complex tasks but suffers from self-occlusion, severe depth noise, and depth information loss when manipulating transparent objects. To solve this problem, this paper proposes TransDex, a 3D visuo-tactile fusion motor policy based on point cloud reconstruction pre-training. Specifically, we first propose a self-supervised point cloud reconstruction pre-training approach based on Transformer. This method accurately recovers the 3D structure of objects from interactive point clouds of dexterous hands, even when random noise and large-scale masking are added. Building on this, TransDex is constructed in which perceptual encoding adopts a fine-grained hierarchical scheme and multi-round attention mechanisms adaptively fuse features of the robotic arm and dexterous hand to enable differentiated motion prediction. Results from transparent object manipulation experiments conducted on a real robotic system demonstrate that TransDex outperforms existing baseline methods. Further analysis validates the generalization capabilities of TransDex and the effectiveness of its individual components.

TransDex: Pre-training Visuo-Tactile Policy with Point Cloud Reconstruction for Dexterous Manipulation of Transparent Objects

Abstract

Dexterous manipulation enables complex tasks but suffers from self-occlusion, severe depth noise, and depth information loss when manipulating transparent objects. To solve this problem, this paper proposes TransDex, a 3D visuo-tactile fusion motor policy based on point cloud reconstruction pre-training. Specifically, we first propose a self-supervised point cloud reconstruction pre-training approach based on Transformer. This method accurately recovers the 3D structure of objects from interactive point clouds of dexterous hands, even when random noise and large-scale masking are added. Building on this, TransDex is constructed in which perceptual encoding adopts a fine-grained hierarchical scheme and multi-round attention mechanisms adaptively fuse features of the robotic arm and dexterous hand to enable differentiated motion prediction. Results from transparent object manipulation experiments conducted on a real robotic system demonstrate that TransDex outperforms existing baseline methods. Further analysis validates the generalization capabilities of TransDex and the effectiveness of its individual components.
Paper Structure (32 sections, 11 equations, 5 figures, 3 tables)

This paper contains 32 sections, 11 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The Overall Framework of the Pre-Training-Based Visual-Tactile Fusion Motor Policy, TransDex. Perceptual information enters the encoder through a fine-grained, hierarchical manner. The hand-object interaction point cloud utilizes a pre-trained encoder, followed by an attention fusion module and two policy heads to achieve feature integration and differentiated action prediction.
  • Figure 2: Pre-Training Framework. First, noise is added to and masked from the original hand-object interaction point cloud to generate input data. Then, features are extracted using a pre-encoder and a Transformer encoder. Finally, the generated query points and a Transformer decoder are employed to accomplish the point cloud reconstruction task for the object.
  • Figure 3: Robotic System Setup:1 a 16-DOF dexterous hand, with array tactile sensors equipped on the fingertips and finger pads; 2 a 7-DOF robotic arm; 3 depth cameras; 4 experimental items; 5 a data glove; 6 a motion capture camera.
  • Figure 4: Visualization of Policy’s rollout on Three Transparent Object Manipulation Tasks, including pouring, shaking, and rotating. The unseen objects used in the test, as well as the complex backgrounds and lighting conditions, are shown at the bottom.
  • Figure 5: Visualization of Point Cloud Reconstruction Performance in Pre-Training Tasks and Real-World Transfer Reconstruction Outcomes.