Table of Contents
Fetching ...

Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

Fengyu Yang, Chao Feng, Ziyang Chen, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, Alex Wong

TL;DR

UniTouch tackles the challenge of learning unified tactile representations across diverse vision-based tactile sensors by aligning tactile embeddings with pretrained image–language embeddings through symmetric, in-batch contrastive losses. It introduces per-sensor learnable tokens to bridge sensor-specific differences and employs a targeted in-batch sampling strategy to balance intra- and inter-sensor negatives, enabling robust multi-sensor training. The approach enables broad zero-shot capabilities, including material classification and grasp stability, cross-modal retrieval with touch, touch-conditioned image synthesis, and X-to-touch generation, and extends to a Touch-LLM by integrating with an open-language model. By demonstrating strong performance across in-domain and out-of-domain sensors and tasks, UniTouch significantly broadens the applicability of tactile sensing in multimodal foundation-model frameworks and reduces data requirements for cross-modal touch understanding.

Abstract

The ability to associate touch with other modalities has huge implications for humans and computational systems. However, multimodal learning with touch remains challenging due to the expensive data collection process and non-standardized sensor outputs. We introduce UniTouch, a unified tactile model for vision-based touch sensors connected to multiple modalities, including vision, language, and sound. We achieve this by aligning our UniTouch embeddings to pretrained image embeddings already associated with a variety of other modalities. We further propose learnable sensor-specific tokens, allowing the model to learn from a set of heterogeneous tactile sensors, all at the same time. UniTouch is capable of conducting various touch sensing tasks in the zero-shot setting, from robot grasping prediction to touch image question answering. To the best of our knowledge, UniTouch is the first to demonstrate such capabilities. Project page: https://cfeng16.github.io/UniTouch/

Binding Touch to Everything: Learning Unified Multimodal Tactile Representations

TL;DR

UniTouch tackles the challenge of learning unified tactile representations across diverse vision-based tactile sensors by aligning tactile embeddings with pretrained image–language embeddings through symmetric, in-batch contrastive losses. It introduces per-sensor learnable tokens to bridge sensor-specific differences and employs a targeted in-batch sampling strategy to balance intra- and inter-sensor negatives, enabling robust multi-sensor training. The approach enables broad zero-shot capabilities, including material classification and grasp stability, cross-modal retrieval with touch, touch-conditioned image synthesis, and X-to-touch generation, and extends to a Touch-LLM by integrating with an open-language model. By demonstrating strong performance across in-domain and out-of-domain sensors and tasks, UniTouch significantly broadens the applicability of tactile sensing in multimodal foundation-model frameworks and reduces data requirements for cross-modal touch understanding.

Abstract

The ability to associate touch with other modalities has huge implications for humans and computational systems. However, multimodal learning with touch remains challenging due to the expensive data collection process and non-standardized sensor outputs. We introduce UniTouch, a unified tactile model for vision-based touch sensors connected to multiple modalities, including vision, language, and sound. We achieve this by aligning our UniTouch embeddings to pretrained image embeddings already associated with a variety of other modalities. We further propose learnable sensor-specific tokens, allowing the model to learn from a set of heterogeneous tactile sensors, all at the same time. UniTouch is capable of conducting various touch sensing tasks in the zero-shot setting, from robot grasping prediction to touch image question answering. To the best of our knowledge, UniTouch is the first to demonstrate such capabilities. Project page: https://cfeng16.github.io/UniTouch/
Paper Structure (61 sections, 6 equations, 10 figures, 8 tables)

This paper contains 61 sections, 6 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Putting touch "in touch" with other modalities. We show that a variety of tactile sensing tasks, ranging from touch image understanding to image synthesis with touch, can be solved zero-shot by aligning touch to pretrained multimodal models, extending previous approaches on work on other modalities Girdhar2023ImageBindOE. Our learned model can be applied to various vision-based tactile sensors and simulators (e.g., GelSight, DIGIT, Taxim, and Tacto). For visualization purposes, we show the corresponding visual signal (labeled "reference") for each touch signal, even though it is not used by the model.
  • Figure 2: Tactile images of different sensors and datasets. In contrast to many other modalities, signals from different touch sensing hardware exhibit large amounts of variation.
  • Figure 3: Method overview. We align our touch embedding with a pre-trained image embedding derived from large-scale vision language data, using sensor-specific tokens for multi-sensor training.
  • Figure 4: Zero-shot image synthesis with touch. (Left) We generate an image of a scene given a tactile signal. (Right) We perform tactile-driven image stylization to manipulate an image to match a given touch signal. We compare our method to the state-of-the-art supervised diffusion method yang2023generating trained on Touch and Go. We denote "reference" as visual images paired with the input touch in the dataset, which are not seen by the model but only shown for the demonstration purpose. See \ref{['sec:supp_exp']} for more examples.
  • Figure 5: Touch-LLM. Our Touch-LLM can conduct a series of tactile question-answer tasks such as robot grasping stability prediction, contact localization, and touch image captioning. We also show "reference" visual images paired with the input touch, for better demonstration. See \ref{['sec:supp_exp']} for more examples.
  • ...and 5 more figures