Touch and Go: Learning from Human-Collected Vision and Touch
Fengyu Yang, Chenyang Ma, Jiacheng Zhang, Jing Zhu, Wenzhen Yuan, Andrew Owens
TL;DR
Touch and Go presents a large-scale, human-collected visuo-tactile dataset captured in natural environments using GelSight, paired with egocentric video. It demonstrates three core applications: self-supervised visuo-tactile representation learning, tactile-driven image stylization, and multimodal future touch prediction, showing gains over baselines and highlighting the value of in-the-wild data for cross-modal understanding. By contrasting with simulated and robot-centric datasets, the work emphasizes diversity in materials and scenes to enable more generalizable visuo-tactile models. The contributions include a detailed data-collection protocol, annotation pipeline, dataset analysis, and comprehensive demonstrations across perception, synthesis, and prediction tasks, with potential impact on manipulation and material understanding in real-world settings.
Abstract
The ability to associate touch with sight is essential for tasks that require physically interacting with objects in the world. We propose a dataset with paired visual and tactile data called Touch and Go, in which human data collectors probe objects in natural environments using tactile sensors, while simultaneously recording egocentric video. In contrast to previous efforts, which have largely been confined to lab settings or simulated environments, our dataset spans a large number of "in the wild" objects and scenes. To demonstrate our dataset's effectiveness, we successfully apply it to a variety of tasks: 1) self-supervised visuo-tactile feature learning, 2) tactile-driven image stylization, i.e., making the visual appearance of an object more consistent with a given tactile signal, and 3) predicting future frames of a tactile signal from visuo-tactile inputs.
