Table of Contents
Fetching ...

ViTaMIn: Learning Contact-Rich Tasks Through Robot-Free Visuo-Tactile Manipulation Interface

Fangchen Liu, Chuanyu Li, Yihua Qin, Jing Xu, Pieter Abbeel, Rui Chen

TL;DR

ViTaMIn introduces a portable visuo-tactile manipulation interface that enables data collection without robot teleoperation by integrating a compliant Fin Ray gripper with tactile sensing and an end-effector camera. A multimodal pre-training pipeline learns tactile representations through masked autoencoding and contrastive alignment with visuals in CLIP space, boosting data efficiency and policy robustness. Across five real-world contact-rich tasks, ViTaMIn outperforms vision-only baselines and exhibits strong generalization under varied objects and lighting, with ablations confirming the value of pre-training. The approach extends robot data collection to be more scalable and flexible by enriching demonstrations with tactile information, improving imitation-learning performance for dexterous manipulation.

Abstract

Tactile information plays a crucial role for humans and robots to interact effectively with their environment, particularly for tasks requiring the understanding of contact properties. Solving such dexterous manipulation tasks often relies on imitation learning from demonstration datasets, which are typically collected via teleoperation systems and often demand substantial time and effort. To address these challenges, we present ViTaMIn, an embodiment-free manipulation interface that seamlessly integrates visual and tactile sensing into a hand-held gripper, enabling data collection without the need for teleoperation. Our design employs a compliant Fin Ray gripper with tactile sensing, allowing operators to perceive force feedback during manipulation for more intuitive operation. Additionally, we propose a multimodal representation learning strategy to obtain pre-trained tactile representations, improving data efficiency and policy robustness. Experiments on seven contact-rich manipulation tasks demonstrate that ViTaMIn significantly outperforms baseline methods, demonstrating its effectiveness for complex manipulation tasks.

ViTaMIn: Learning Contact-Rich Tasks Through Robot-Free Visuo-Tactile Manipulation Interface

TL;DR

ViTaMIn introduces a portable visuo-tactile manipulation interface that enables data collection without robot teleoperation by integrating a compliant Fin Ray gripper with tactile sensing and an end-effector camera. A multimodal pre-training pipeline learns tactile representations through masked autoencoding and contrastive alignment with visuals in CLIP space, boosting data efficiency and policy robustness. Across five real-world contact-rich tasks, ViTaMIn outperforms vision-only baselines and exhibits strong generalization under varied objects and lighting, with ablations confirming the value of pre-training. The approach extends robot data collection to be more scalable and flexible by enriching demonstrations with tactile information, improving imitation-learning performance for dexterous manipulation.

Abstract

Tactile information plays a crucial role for humans and robots to interact effectively with their environment, particularly for tasks requiring the understanding of contact properties. Solving such dexterous manipulation tasks often relies on imitation learning from demonstration datasets, which are typically collected via teleoperation systems and often demand substantial time and effort. To address these challenges, we present ViTaMIn, an embodiment-free manipulation interface that seamlessly integrates visual and tactile sensing into a hand-held gripper, enabling data collection without the need for teleoperation. Our design employs a compliant Fin Ray gripper with tactile sensing, allowing operators to perceive force feedback during manipulation for more intuitive operation. Additionally, we propose a multimodal representation learning strategy to obtain pre-trained tactile representations, improving data efficiency and policy robustness. Experiments on seven contact-rich manipulation tasks demonstrate that ViTaMIn significantly outperforms baseline methods, demonstrating its effectiveness for complex manipulation tasks.

Paper Structure

This paper contains 19 sections, 3 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: ViTaMIn's hardware system overview. The handheld device integrates a GoPro camera, two tactile sensors and a synchronization camera to align visual and tactile information. During data collection, the two tactile sensors and the synchronization camera are connected to the Raspberry Pi in the backbox. The total weight of the gripper is approximately 1960g. Left: Side view of the ViTaMIn system. Right: Top view of the ViTaMIn system with the backbox cover removed.
  • Figure 2: The illustration of the multimodal contrastive representation pre-training phase. The tactile encoder is trained to capture complementary information to predict the missing content for the future image.
  • Figure 3: Hardware setup for policy deployment.
  • Figure 4: We test ViTaMIn on 5 contact-rich manipulation tasks, including precise and dynamic insertion, object hanging with multimodal feedback, and transparent in-hand object manipulation.
  • Figure 5: The robot needs to flip open a switch (fixed to a force gauge) by rotating it 90 degrees. During the rotation, the robot must minimize axial forces to ensure smooth operation.
  • ...and 3 more figures