A Touch, Vision, and Language Dataset for Multimodal Alignment

Letian Fu; Gaurav Datta; Huang Huang; William Chung-Ho Panitch; Jaimyn Drake; Joseph Ortiz; Mustafa Mukadam; Mike Lambeta; Roberto Calandra; Ken Goldberg

A Touch, Vision, and Language Dataset for Multimodal Alignment

Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, Ken Goldberg

TL;DR

This work tackles the challenge of integrating tactile sensing into open-vocabulary vision-language models by building the TVL dataset, a 44K vision-tactile corpus with 10% human and 90% GPT-4V labels collected in-the-wild. It introduces a vision-language-aligned tactile encoder and a TVL-LLaMA generator, trained via pairwise contrastive learning across tactile, vision, and language modalities and then fine-tuned to produce tactile descriptions. The approach demonstrates substantial gains in touch-vision-language alignment (+$29\%$) and tactile-language generation quality (+$12\%$ vs GPT-4V, +$32\%$ vs open-source VLMs) on a novel TVL Benchmark, validating the benefit of adding touch to multimodal understanding. The work advances open-vocabulary tactile understanding with scalable pseudo-labeling and paves the way for more capable, touch-aware embodied AI and robotics systems.

Abstract

Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human-labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code and data: https://tactile-vlm.github.io.

A Touch, Vision, and Language Dataset for Multimodal Alignment

TL;DR

) and tactile-language generation quality (+

vs GPT-4V, +

vs open-source VLMs) on a novel TVL Benchmark, validating the benefit of adding touch to multimodal understanding. The work advances open-vocabulary tactile understanding with scalable pseudo-labeling and paves the way for more capable, touch-aware embodied AI and robotics systems.

Abstract

Paper Structure (39 sections, 16 figures, 12 tables)

This paper contains 39 sections, 16 figures, 12 tables.

Introduction
Related Work
Learning Multimodal Encoders
Tactile Perception
Multimodal Alignment in LLMs
Training from Pseudo-labels
TVL Dataset
Data Collection
Cleaning Candidate Tactile Images
Language Labeling
Dataset Statistics
Tactile-Vision-Language Model
Preliminary
Tactile Encoder
Alignment with Language Models
...and 24 more sections

Figures (16)

Figure 1: Can embodied agents integrate touch with vision and language? To the best of our knowledge, this work presents the first open-vocabulary tactile-vision-language dataset and we train 1) a vision-language aligned tactile encoder and 2) a tactile-vision-language model (TVLM) for describing tactile sensations.
Figure 2: (1) We designed a 3D printed data collection device using the DIGIT tactile sensor and a webcam to synchronously collect tactile and vision observations "in-the-wild" (2). (3) We press and slide the device on surfaces and objects for data collection.
Figure 3: TVL Dataset starts by combining two datasets: SSVTP kerr2023selfsupervised (4,587 image-touch pairs) and HCT (39,154 image-touch pairs), a new dataset we collected such that the visual observation and the tactile input are synchronously captured. For the SSVTP dataset, we then manually label the data (examples shown in the first row). For the newly collected dataset, we prompt GPT-4V (see \ref{['sec:appendix:pseudo']}) to label the dataset (examples shown in rows 2-4). Note that GPT-4V will fail to provide correct tactile labels (row 4) when the contact patch is occluded by the sensor, or when there is not sufficient information to estimate the tactile sensation. In total, this results in a dataset containing 43,741 image-touch pairs with open-vocabulary language labels.
Figure 4: Method. (Left) TVL is different from ImageBind girdhar2023imagebind as ImageBind only considers the loss between the vision modality and every other modality. TVL calculates loss between every pair of modalities, including that between the new modality (tactile) and language. Empirically, we show that including such loss can improve the model's capability to capture tactile semantics. (Right) Following han2023imagebindllm, we average the latent from the tactile and vision modality and finetune the language model.
Figure 5: Left: We measure the cosine similarity between tactile and language on the entire test set containing 402 tactile, image, and language triplets. However, because different tactile observations may have synonymous language descriptions, in \ref{['ssec:eval_metrics']} we update top-1 and top-5 accuracy calculations to take this into account. Right: GPT-4V and TVL-LLaMA generations with scores rated by GPT-4 based on the human labels. GPT-4V may be distracted by objects that are not in contact as it does not take tactile into account, and we empirically found there is no improvement when including tactile observation when prompting it because the observation is out-of-distribution. As TVL-LLaMA is trained on GPT-4V pseudo-labels, it suffers from the same failure mode.
...and 11 more figures

A Touch, Vision, and Language Dataset for Multimodal Alignment

TL;DR

Abstract

A Touch, Vision, and Language Dataset for Multimodal Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (16)