Binding Touch to Everything: Learning Unified Multimodal Tactile Representations
Fengyu Yang, Chao Feng, Ziyang Chen, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, Alex Wong
TL;DR
UniTouch tackles the challenge of learning unified tactile representations across diverse vision-based tactile sensors by aligning tactile embeddings with pretrained image–language embeddings through symmetric, in-batch contrastive losses. It introduces per-sensor learnable tokens to bridge sensor-specific differences and employs a targeted in-batch sampling strategy to balance intra- and inter-sensor negatives, enabling robust multi-sensor training. The approach enables broad zero-shot capabilities, including material classification and grasp stability, cross-modal retrieval with touch, touch-conditioned image synthesis, and X-to-touch generation, and extends to a Touch-LLM by integrating with an open-language model. By demonstrating strong performance across in-domain and out-of-domain sensors and tasks, UniTouch significantly broadens the applicability of tactile sensing in multimodal foundation-model frameworks and reduces data requirements for cross-modal touch understanding.
Abstract
The ability to associate touch with other modalities has huge implications for humans and computational systems. However, multimodal learning with touch remains challenging due to the expensive data collection process and non-standardized sensor outputs. We introduce UniTouch, a unified tactile model for vision-based touch sensors connected to multiple modalities, including vision, language, and sound. We achieve this by aligning our UniTouch embeddings to pretrained image embeddings already associated with a variety of other modalities. We further propose learnable sensor-specific tokens, allowing the model to learn from a set of heterogeneous tactile sensors, all at the same time. UniTouch is capable of conducting various touch sensing tasks in the zero-shot setting, from robot grasping prediction to touch image question answering. To the best of our knowledge, UniTouch is the first to demonstrate such capabilities. Project page: https://cfeng16.github.io/UniTouch/
