A Case Study on Visual-Audio-Tactile Cross-Modal Retrieval

Jagoda Wojcik; Jiaqi Jiang; Jiacheng Wu; Shan Luo

A Case Study on Visual-Audio-Tactile Cross-Modal Retrieval

Jagoda Wojcik, Jiaqi Jiang, Jiacheng Wu, Shan Luo

TL;DR

VAT-CMR addresses tri-modal cross-modal retrieval across vision, audio, and tactile sensing for robotics. It introduces a three-branch network with multi-head attention fusion and a dominant-modality cross-entropy objective, followed by a triplet loss to map features into a common semantic space. The method is evaluated on a synthetic ObjectFolder-based dataset of 20 objects, showing superior MAP over CCA and ObjectFolder baselines, with notable gains for vision, touch, and audio queries. Ablation studies confirm the contributions of the attention module and dominant modality training, while visualizations illustrate progressive organization of latent representations; future work includes real-world generalization and extending to additional modalities and zero-shot retrieval.

Abstract

Cross-Modal Retrieval (CMR), which retrieves relevant items from one modality (e.g., audio) given a query in another modality (e.g., visual), has undergone significant advancements in recent years. This capability is crucial for robots to integrate and interpret information across diverse sensory inputs. However, the retrieval space in existing robotic CMR approaches often consists of only one modality, which limits the robot's performance. In this paper, we propose a novel CMR model that incorporates three different modalities, i.e., visual, audio and tactile, for enhanced multi-modal object retrieval, named as VAT-CMR. In this model, multi-modal representations are first fused to provide a holistic view of object features. To mitigate the semantic gaps between representations of different modalities, a dominant modality is then selected during the classification training phase to improve the distinctiveness of the representations, so as to improve the retrieval performance. To evaluate our proposed approach, we conducted a case study and the results demonstrate that our VAT-CMR model surpasses competing approaches. Further, our proposed dominant modality selection significantly enhances cross-retrieval accuracy.

A Case Study on Visual-Audio-Tactile Cross-Modal Retrieval

TL;DR

Abstract

Paper Structure (16 sections, 4 equations, 6 figures, 3 tables)

This paper contains 16 sections, 4 equations, 6 figures, 3 tables.

INTRODUCTION
Related Works
Methodology
Problem Formulation
The proposed VAT-CMR model
Disjoint neural networks architecture
Attention mechanism
Cross-entropy loss for dominant modality optimisation
Cross-modal correlation learning
Training Details
Experimental Setup
Experiment Results
Comparison with single modality-based retrieval methods
Ablation study
Feature visualisation and learning curve
...and 1 more sections

Figures (6)

Figure 1: Illustration of visual-audio-tactile cross-modal retrieval. Left: The visual, audio and tactile representations of the same object converge within a shared space. Right: the robot retrieves the corresponding audio and tactile data when provided with a visual image of the dustpan.
Figure 2: Overview of our VAT-CMR model (with audio as the query modality for example). From left to right: First, VAT-CMR takes a visual image, a tactile image and an audio sample as input. These inputs are processed through three separate neural network branches. Multi-head attention modules are used to fuse the feature representations from the two retrieval modalities. With the fused feature representation $F$, and the positive input modality feature $P$, the model is trained using cross-entropy loss where audio serves as the dominant modality and is directly linked via a solid line. The dashed lines connected to the cross-entropy module represent the cases when other modalities are selected as the dominant modality. Finally, a triplet loss function is employed to map the features extracted from the last hidden layer of each branch to a cross-sensory embedding space. $N$ on the right side represents a negative sample used in triplet loss training.
Figure 3: Objects used in our experiments. In total, there are 20 objects in our experiments, taken from the ObjectFolder dataset gao2022objectfolder2, each with unique geometric characteristics and made from materials with distinct properties.
Figure 4: Visualisation of two object examples from our generated synthetic dataset. From left to right: The columns represent the RGB visual images, audio spectrograms, and tactile images, respectively.
Figure 5: 2D plots illustrating latent representations for a selected set of five classes, using Audio as the query test samples, and Visual+Tactile as the retrieval space. Left: Features after undergoing the cross-entropy model stage; Right: The same set of features after triplet loss processing.
...and 1 more figures

A Case Study on Visual-Audio-Tactile Cross-Modal Retrieval

TL;DR

Abstract

A Case Study on Visual-Audio-Tactile Cross-Modal Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (6)