Table of Contents
Fetching ...

The Language of Touch: Translating Vibrations into Text with Dual-Branch Learning

Jin Chen, Yifeng Lin, Chao Zeng, Si Wu, Tiesong Zhao

Abstract

The standardization of vibrotactile data by IEEE P1918.1 workgroup has greatly advanced its applications in virtual reality, human-computer interaction and embodied artificial intelligence. Despite these efforts, the semantic interpretation and understanding of vibrotactile signals remain an unresolved challenge. In this paper, we make the first attempt to address vibrotactile captioning, {\it i.e.}, generating natural language descriptions from vibrotactile signals. We propose Vibrotactile Periodic-Aperiodic Captioning (ViPAC), a method designed to handle the intrinsic properties of vibrotactile data, including hybrid periodic-aperiodic structures and the lack of spatial semantics. Specifically, ViPAC employs a dual-branch strategy to disentangle periodic and aperiodic components, combined with a dynamic fusion mechanism that adaptively integrates signal features. It also introduces an orthogonality constraint and weighting regularization to ensure feature complementarity and fusion consistency. Additionally, we construct LMT108-CAP, the first vibrotactile-text paired dataset, using GPT-4o to generate five constrained captions per surface image from the popular LMT-108 dataset. Experiments show that ViPAC significantly outperforms the baseline methods adapted from audio and image captioning, achieving superior lexical fidelity and semantic alignment.

The Language of Touch: Translating Vibrations into Text with Dual-Branch Learning

Abstract

The standardization of vibrotactile data by IEEE P1918.1 workgroup has greatly advanced its applications in virtual reality, human-computer interaction and embodied artificial intelligence. Despite these efforts, the semantic interpretation and understanding of vibrotactile signals remain an unresolved challenge. In this paper, we make the first attempt to address vibrotactile captioning, {\it i.e.}, generating natural language descriptions from vibrotactile signals. We propose Vibrotactile Periodic-Aperiodic Captioning (ViPAC), a method designed to handle the intrinsic properties of vibrotactile data, including hybrid periodic-aperiodic structures and the lack of spatial semantics. Specifically, ViPAC employs a dual-branch strategy to disentangle periodic and aperiodic components, combined with a dynamic fusion mechanism that adaptively integrates signal features. It also introduces an orthogonality constraint and weighting regularization to ensure feature complementarity and fusion consistency. Additionally, we construct LMT108-CAP, the first vibrotactile-text paired dataset, using GPT-4o to generate five constrained captions per surface image from the popular LMT-108 dataset. Experiments show that ViPAC significantly outperforms the baseline methods adapted from audio and image captioning, achieving superior lexical fidelity and semantic alignment.

Paper Structure

This paper contains 20 sections, 14 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Application scenarios of vibrotactile captioning.
  • Figure 2: Illustration of the vibrotactile-text dataset generation process. Surface images from the LMT-108 dataset are provided as input to GPT-4o, which generates five textual descriptions per image under predefined linguistic constraints. These descriptions are then paired with the corresponding triaxial acceleration signals collected from the same material surfaces, resulting in the final vibrotactile-text dataset.
  • Figure 3: Examples of material surface images and their corresponding three-axis vibrotactile signals. Top: materials with regular textures exhibit strong periodicity. Bottom: irregular surfaces yield noisy, aperiodic signals. This motivates the use of distinct modeling pathways.
  • Figure 4: ViPAC takes triaxial acceleration signals as input and applies DFT321 to obtain 1D vibration data. These signals are processed by a dual-branch encoder that separately models periodic and aperiodic components using FAN-based frequency analysis and Transformer+LSTM-based temporal modeling, respectively. The extracted features are dynamically fused based on estimated periodicity scores, and the fused representation is decoded into natural language using a Transformer decoder.
  • Figure 5: Qualitative comparisons between ViPAC generated captions and five GPT-4o reference descriptions for four representative materials. Matched phrases are highlighted to emphasize semantic consistency. The selected samples—covering regular perforations, fine grids, rough glitter, and irregular bumps—demonstrate ViPAC’s ability to produce accurate and diverse textual descriptions directly from vibrotactile signals.
  • ...and 1 more figures