ViTaMIn: Learning Contact-Rich Tasks Through Robot-Free Visuo-Tactile Manipulation Interface
Fangchen Liu, Chuanyu Li, Yihua Qin, Jing Xu, Pieter Abbeel, Rui Chen
TL;DR
ViTaMIn introduces a portable visuo-tactile manipulation interface that enables data collection without robot teleoperation by integrating a compliant Fin Ray gripper with tactile sensing and an end-effector camera. A multimodal pre-training pipeline learns tactile representations through masked autoencoding and contrastive alignment with visuals in CLIP space, boosting data efficiency and policy robustness. Across five real-world contact-rich tasks, ViTaMIn outperforms vision-only baselines and exhibits strong generalization under varied objects and lighting, with ablations confirming the value of pre-training. The approach extends robot data collection to be more scalable and flexible by enriching demonstrations with tactile information, improving imitation-learning performance for dexterous manipulation.
Abstract
Tactile information plays a crucial role for humans and robots to interact effectively with their environment, particularly for tasks requiring the understanding of contact properties. Solving such dexterous manipulation tasks often relies on imitation learning from demonstration datasets, which are typically collected via teleoperation systems and often demand substantial time and effort. To address these challenges, we present ViTaMIn, an embodiment-free manipulation interface that seamlessly integrates visual and tactile sensing into a hand-held gripper, enabling data collection without the need for teleoperation. Our design employs a compliant Fin Ray gripper with tactile sensing, allowing operators to perceive force feedback during manipulation for more intuitive operation. Additionally, we propose a multimodal representation learning strategy to obtain pre-trained tactile representations, improving data efficiency and policy robustness. Experiments on seven contact-rich manipulation tasks demonstrate that ViTaMIn significantly outperforms baseline methods, demonstrating its effectiveness for complex manipulation tasks.
