Touch and Go: Learning from Human-Collected Vision and Touch

Fengyu Yang; Chenyang Ma; Jiacheng Zhang; Jing Zhu; Wenzhen Yuan; Andrew Owens

Touch and Go: Learning from Human-Collected Vision and Touch

Fengyu Yang, Chenyang Ma, Jiacheng Zhang, Jing Zhu, Wenzhen Yuan, Andrew Owens

TL;DR

Touch and Go presents a large-scale, human-collected visuo-tactile dataset captured in natural environments using GelSight, paired with egocentric video. It demonstrates three core applications: self-supervised visuo-tactile representation learning, tactile-driven image stylization, and multimodal future touch prediction, showing gains over baselines and highlighting the value of in-the-wild data for cross-modal understanding. By contrasting with simulated and robot-centric datasets, the work emphasizes diversity in materials and scenes to enable more generalizable visuo-tactile models. The contributions include a detailed data-collection protocol, annotation pipeline, dataset analysis, and comprehensive demonstrations across perception, synthesis, and prediction tasks, with potential impact on manipulation and material understanding in real-world settings.

Abstract

The ability to associate touch with sight is essential for tasks that require physically interacting with objects in the world. We propose a dataset with paired visual and tactile data called Touch and Go, in which human data collectors probe objects in natural environments using tactile sensors, while simultaneously recording egocentric video. In contrast to previous efforts, which have largely been confined to lab settings or simulated environments, our dataset spans a large number of "in the wild" objects and scenes. To demonstrate our dataset's effectiveness, we successfully apply it to a variety of tasks: 1) self-supervised visuo-tactile feature learning, 2) tactile-driven image stylization, i.e., making the visual appearance of an object more consistent with a given tactile signal, and 3) predicting future frames of a tactile signal from visuo-tactile inputs.

Touch and Go: Learning from Human-Collected Vision and Touch

TL;DR

Abstract

Paper Structure (58 sections, 6 equations, 7 figures, 5 tables)

This paper contains 58 sections, 6 equations, 7 figures, 5 tables.

Introduction
Related Work
Simulated vision and touch.
Robotic vision and touch.
Human-collected multimodal data.
Multimodal feature learning.
Multimodal image prediction.
Cross-modal image stylization.
The Touch and Go Dataset
Collecting a natural visuo-tactile dataset
Capturing procedure.
Hardware.
Annotating the dataset
Detecting the press.
Labeling materials.
...and 43 more sections

Figures (7)

Figure 1: The Touch and Go dataset. We collect a dataset of real-world visual and touch data. (a) Humans walk through a large number of scenes, probing objects around them with a touch sensor and recording video. We apply this dataset to: (b) learning tactile features through self-supervision by associating touch with sight, (c) manipulating an image to match the tactile signal (e.g., restyling a smooth surface to match the tactile signal for a rough rock, whose photo we show for reference), (d) predicting future tactile signals from visuo-tactile inputs.
Figure 2: The Touch and Go Dataset. Human data collectors record paired visual and tactile information by probing objects in a variety of indoor and outdoor spaces. We show a selection of images, paired with the corresponding frame recorded by the GelSight tactile sensor. We show 16 representative categories (out of 20), and provide the distribution of material and scene types.
Figure 3: Visuo-tactile data from other datasets. We provide qualitative examples of visual and tactile data from other datasets (left), along with examples from similar material taken from our dataset (right).
Figure 4: Qualitative results of our model on tactile-driven image stylization. For each row, we show an input image (left) and the manipulated image (to its right) obtained by stylizing with a given tactile input (right side). For reference, we also show the image that corresponds to the tactile example at rightmost (not used by the model). The manipulated images convey physical properties of the tactile signal, such as its roughness (e.g., first three rows) or smoothness (e.g., row 10). Other inputs result in images that combine the properties of two inputs (e.g., by adding grass, as in row 8). We also show failure cases in the last row. Zoom in for better view.
Figure 5: Future touch prediction. We show the results of tactile-only and visuo-tactile models.
...and 2 more figures

Touch and Go: Learning from Human-Collected Vision and Touch

TL;DR

Abstract

Touch and Go: Learning from Human-Collected Vision and Touch

Authors

TL;DR

Abstract

Table of Contents

Figures (7)