Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset
Ning Cheng, You Li, Jing Gao, Bin Fang, Jinan Xu, Wenjuan Han
TL;DR
This work addresses the lack of sentence-level language data in tactile multimodal perception by introducing the TLV dataset, which couples touch, language, and vision through a three-stage annotation pipeline. It then presents STLV-Align, an unsupervised, lightweight training framework that uses LoRA fine-tuning and OpenCLIP encoders to map all three modalities into a shared embedding space, with a frozen text encoder and symmetric contrastive losses. The TLV dataset comprises 19,834 annotated entries (9,834 with touch and 10,000 no-touch) derived from 20,000 VisGel-based pairs, and STLV-Align demonstrates substantial improvements on cross-domain tactile classification tasks, including material, hard/soft, and rough/smooth attributes, using only $1\%$ of parameters updated. The work advances tactile perception by enabling richer cross-modal alignment and points to practical benefits for robotics and human-robot interaction, while leaving room for further performance enhancements and broader task applications.
Abstract
Tactility provides crucial support and enhancement for the perception and interaction capabilities of both humans and robots. Nevertheless, the multimodal research related to touch primarily focuses on visual and tactile modalities, with limited exploration in the domain of language. Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. The new dataset is used to fine-tune our proposed lightweight training framework, STLV-Align (Synergistic Touch-Language-Vision Alignment), achieving effective semantic alignment with minimal parameter adjustments (1%). Project Page: https://xiaoen0.github.io/touch.page/.
