Table of Contents
Fetching ...

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper

Xinyue Zhu, Binghao Huang, Yunzhu Li

TL;DR

This work addresses the scarcity of tactile feedback in handheld robotic data collection by introducing a portable visuo-tactile gripper and a scalable, cross-modal learning framework. A two-stage pipeline first learns a fused visuo-tactile representation via masked tactile reconstruction and then utilizes it within a conditional diffusion policy for fine-grained manipulation. The authors collect and publicly release a large in-the-wild dataset with over 2.6 million visuo-tactile pairs spanning 43 tasks, demonstrating improved robustness and sample efficiency on real-world tasks such as test tube insertion and pipette-based fluid transfer. The findings highlight the value of tactile signals for in-hand state estimation, phase transitions, and coordinated vision–touch analysis, with practical implications for deploying robust, contact-rich manipulation in unstructured environments.

Abstract

Handheld grippers are increasingly used to collect human demonstrations due to their ease of deployment and versatility. However, most existing designs lack tactile sensing, despite the critical role of tactile feedback in precise manipulation. We present a portable, lightweight gripper with integrated tactile sensors that enables synchronized collection of visual and tactile data in diverse, real-world, and in-the-wild settings. Building on this hardware, we propose a cross-modal representation learning framework that integrates visual and tactile signals while preserving their distinct characteristics. The learning procedure allows the emergence of interpretable representations that consistently focus on contacting regions relevant for physical interactions. When used for downstream manipulation tasks, these representations enable more efficient and effective policy learning, supporting precise robotic manipulation based on multimodal feedback. We validate our approach on fine-grained tasks such as test tube insertion and pipette-based fluid transfer, demonstrating improved accuracy and robustness under external disturbances. Our project page is available at https://binghao-huang.github.io/touch_in_the_wild/ .

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper

TL;DR

This work addresses the scarcity of tactile feedback in handheld robotic data collection by introducing a portable visuo-tactile gripper and a scalable, cross-modal learning framework. A two-stage pipeline first learns a fused visuo-tactile representation via masked tactile reconstruction and then utilizes it within a conditional diffusion policy for fine-grained manipulation. The authors collect and publicly release a large in-the-wild dataset with over 2.6 million visuo-tactile pairs spanning 43 tasks, demonstrating improved robustness and sample efficiency on real-world tasks such as test tube insertion and pipette-based fluid transfer. The findings highlight the value of tactile signals for in-hand state estimation, phase transitions, and coordinated vision–touch analysis, with practical implications for deploying robust, contact-rich manipulation in unstructured environments.

Abstract

Handheld grippers are increasingly used to collect human demonstrations due to their ease of deployment and versatility. However, most existing designs lack tactile sensing, despite the critical role of tactile feedback in precise manipulation. We present a portable, lightweight gripper with integrated tactile sensors that enables synchronized collection of visual and tactile data in diverse, real-world, and in-the-wild settings. Building on this hardware, we propose a cross-modal representation learning framework that integrates visual and tactile signals while preserving their distinct characteristics. The learning procedure allows the emergence of interpretable representations that consistently focus on contacting regions relevant for physical interactions. When used for downstream manipulation tasks, these representations enable more efficient and effective policy learning, supporting precise robotic manipulation based on multimodal feedback. We validate our approach on fine-grained tasks such as test tube insertion and pipette-based fluid transfer, demonstrating improved accuracy and robustness under external disturbances. Our project page is available at https://binghao-huang.github.io/touch_in_the_wild/ .

Paper Structure

This paper contains 29 sections, 5 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: (a) Our portable handheld gripper enables synchronized collection of visual and tactile data, supporting large-scale data collection in the wild. (b) We introduce a multimodal representation learning framework that fuses visual and tactile inputs to support fine-grained downstream manipulation tasks.
  • Figure 2: (a) Multimodal data collection in the wild using our portable visuo-tactile system, with example tactile signals from both fingers. (b) Close-up of the handheld gripper, equipped with flexible tactile sensors and a fisheye camera for synchronized visuo-tactile capture. (c) Robotic setup for downstream tasks, featuring an XArm 850 with the same sensor configuration.
  • Figure 3: Method Overview of Our Two-Stage Pipeline.Stage 1: We pretrain a visuo-tactile encoder via cross-modal reconstruction using a large-scale dataset collected across diverse indoor and outdoor environments. Stage 2: The pretrained encoder is combined with robot proprioception to condition a diffusion policy for downstream tasks such as object reorientation and insertion.
  • Figure 4: Pretraining Data Distribution. Our dataset comprises over 2,700 demonstrations, split across three categories: (1) the four core tasks introduced in this paper, (2) other indoor tasks to broaden the data distribution, and (3) in-the-wild tasks collected in diverse outdoor environments. We include representative examples from each category to highlight the variety in both task complexity and environmental context.
  • Figure 5: Qualitative Results of Pretraining. We show four examples illustrating our pretrained encoder’s tactile reconstruction performance and ViT self-attention heatmaps. The encoder accurately reconstructs tactile images for both in- and out-of-distribution inputs, while the vision module consistently attends to the gripper–contact region, independent of background or object familiarity.
  • ...and 6 more figures