Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

Ning Cheng; You Li; Jing Gao; Bin Fang; Jinan Xu; Wenjuan Han

Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

Ning Cheng, You Li, Jing Gao, Bin Fang, Jinan Xu, Wenjuan Han

TL;DR

This work addresses the lack of sentence-level language data in tactile multimodal perception by introducing the TLV dataset, which couples touch, language, and vision through a three-stage annotation pipeline. It then presents STLV-Align, an unsupervised, lightweight training framework that uses LoRA fine-tuning and OpenCLIP encoders to map all three modalities into a shared embedding space, with a frozen text encoder and symmetric contrastive losses. The TLV dataset comprises 19,834 annotated entries (9,834 with touch and 10,000 no-touch) derived from 20,000 VisGel-based pairs, and STLV-Align demonstrates substantial improvements on cross-domain tactile classification tasks, including material, hard/soft, and rough/smooth attributes, using only $1\%$ of parameters updated. The work advances tactile perception by enabling richer cross-modal alignment and points to practical benefits for robotics and human-robot interaction, while leaving room for further performance enhancements and broader task applications.

Abstract

Tactility provides crucial support and enhancement for the perception and interaction capabilities of both humans and robots. Nevertheless, the multimodal research related to touch primarily focuses on visual and tactile modalities, with limited exploration in the domain of language. Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. The new dataset is used to fine-tune our proposed lightweight training framework, STLV-Align (Synergistic Touch-Language-Vision Alignment), achieving effective semantic alignment with minimal parameter adjustments (1%). Project Page: https://xiaoen0.github.io/touch.page/.

Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

TL;DR

of parameters updated. The work advances tactile perception by enabling richer cross-modal alignment and points to practical benefits for robotics and human-robot interaction, while leaving room for further performance enhancements and broader task applications.

Abstract

Paper Structure (19 sections, 2 equations, 2 figures, 3 tables)

This paper contains 19 sections, 2 equations, 2 figures, 3 tables.

Introduction
Related Work
Tactile Perception
Tactile Datasets
Multimodal Alignment
TLV Dataset
Stage I: Touch and Vision Collection
Stage II: Touch Localization
Stage III: Tactile Labeling
Dataset Statistics
Method
Multi-modal Encoders
LoRA Fine-tuning
Joint Training
Experiments
...and 4 more sections

Figures (2)

Figure 1: Construction process of the TLV dataset.
Figure 2: Overview of our lightweight joint training method.

Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

TL;DR

Abstract

Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (2)