Table of Contents
Fetching ...

DexViTac: Collecting Human Visuo-Tactile-Kinematic Demonstrations for Contact-Rich Dexterous Manipulation

Xitong Chen, Yifeng Pan, Min Li, Xiaotian Ding

Abstract

Large-scale, high-quality multimodal demonstrations are essential for robot learning of contact-rich dexterous manipulation. While human-centric data collection systems lower the barrier to scaling, they struggle to capture the tactile information during physical interactions. Motivated by this, we present DexViTac, a portable, human-centric data collection system tailored for contact-rich dexterous manipulation. The system enables the high-fidelity acquisition of first-person vision, high-density tactile sensing, end-effector poses, and hand kinematics within unstructured, in-the-wild environments. Building upon this hardware, we propose a kinematics-grounded tactile representation learning algorithm that effectively resolves semantic ambiguities within tactile signals. Leveraging the efficiency of DexViTac, we construct a multimodal dataset comprising over 2,400 visuo-tactile-kinematic demonstrations. Experiments demonstrate that DexViTac achieves a collection efficiency exceeding 248 demonstrations per hour and remains robust against complex visual occlusions. Real-world deployment confirms that policies trained with the proposed dataset and learning strategy achieve an average success rate exceeding 85% across four challenging tasks. This performance significantly outperforms baseline methods, thereby validating the substantial improvement the system provides for learning contact-rich dexterous manipulation. Project page: https://xitong-c.github.io/DexViTac/.

DexViTac: Collecting Human Visuo-Tactile-Kinematic Demonstrations for Contact-Rich Dexterous Manipulation

Abstract

Large-scale, high-quality multimodal demonstrations are essential for robot learning of contact-rich dexterous manipulation. While human-centric data collection systems lower the barrier to scaling, they struggle to capture the tactile information during physical interactions. Motivated by this, we present DexViTac, a portable, human-centric data collection system tailored for contact-rich dexterous manipulation. The system enables the high-fidelity acquisition of first-person vision, high-density tactile sensing, end-effector poses, and hand kinematics within unstructured, in-the-wild environments. Building upon this hardware, we propose a kinematics-grounded tactile representation learning algorithm that effectively resolves semantic ambiguities within tactile signals. Leveraging the efficiency of DexViTac, we construct a multimodal dataset comprising over 2,400 visuo-tactile-kinematic demonstrations. Experiments demonstrate that DexViTac achieves a collection efficiency exceeding 248 demonstrations per hour and remains robust against complex visual occlusions. Real-world deployment confirms that policies trained with the proposed dataset and learning strategy achieve an average success rate exceeding 85% across four challenging tasks. This performance significantly outperforms baseline methods, thereby validating the substantial improvement the system provides for learning contact-rich dexterous manipulation. Project page: https://xitong-c.github.io/DexViTac/.
Paper Structure (15 sections, 7 equations, 9 figures, 1 table)

This paper contains 15 sections, 7 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Overview of DexViTac. DexViTac is a portable, human-centric data collection system designed for contact-rich dexterous manipulation. It enables the high-fidelity acquisition of first-person vision, high-density tactile sensing, and hand kinematics within unstructured, in-the-wild environments, facilitating the development of generalized robotic policies across diverse hardware platforms.
  • Figure 2: Hardware design. (a) Equipped with a backpack-integrated Mini-PC and power bank, the proposed system enables out-of-the-box multimodal data collection within in-the-wild environments. (b) The human demonstration interface features a decoupled design comprising a fisheye camera, motion-capture gloves, high-resolution tactile sensors, and a T265 tracking camera. (c) The robot execution platform utilizes an isomorphic perception architecture wherein the tactile sensors remain strictly consistent with those on the human demonstration interface.
  • Figure 3: Data collection pipeline. To prevent frame loss and ensure tight spatiotemporal alignment across different modalities, we employ high-frequency buffering alongside a tactile-anchored synchronization strategy that involves downsampling and nearest-neighbor matching.
  • Figure 4: Two-stage learning strategy. Stage $1$: A self-supervised framework aligns high-density tactile features with visual anchors utilizing a kinematics-Grounded encoder to learn spatially anchored representations. Stage $2$: The pretrained encoders are subsequently integrated into an Action Chunking with Transformers (ACT) policy to map synchronized multimodal observations to multi-step action sequences for contact-rich dexterous manipulation.
  • Figure 5: Real-world experimental deployment. The figure illustrates the deployment of the proposed full method across four representative tasks: pipetting, whiteboard erasing, pen insertion, and fruit collection.
  • ...and 4 more figures