A Case Study on Visual-Audio-Tactile Cross-Modal Retrieval
Jagoda Wojcik, Jiaqi Jiang, Jiacheng Wu, Shan Luo
TL;DR
VAT-CMR addresses tri-modal cross-modal retrieval across vision, audio, and tactile sensing for robotics. It introduces a three-branch network with multi-head attention fusion and a dominant-modality cross-entropy objective, followed by a triplet loss to map features into a common semantic space. The method is evaluated on a synthetic ObjectFolder-based dataset of 20 objects, showing superior MAP over CCA and ObjectFolder baselines, with notable gains for vision, touch, and audio queries. Ablation studies confirm the contributions of the attention module and dominant modality training, while visualizations illustrate progressive organization of latent representations; future work includes real-world generalization and extending to additional modalities and zero-shot retrieval.
Abstract
Cross-Modal Retrieval (CMR), which retrieves relevant items from one modality (e.g., audio) given a query in another modality (e.g., visual), has undergone significant advancements in recent years. This capability is crucial for robots to integrate and interpret information across diverse sensory inputs. However, the retrieval space in existing robotic CMR approaches often consists of only one modality, which limits the robot's performance. In this paper, we propose a novel CMR model that incorporates three different modalities, i.e., visual, audio and tactile, for enhanced multi-modal object retrieval, named as VAT-CMR. In this model, multi-modal representations are first fused to provide a holistic view of object features. To mitigate the semantic gaps between representations of different modalities, a dominant modality is then selected during the classification training phase to improve the distinctiveness of the representations, so as to improve the retrieval performance. To evaluate our proposed approach, we conducted a case study and the results demonstrate that our VAT-CMR model surpasses competing approaches. Further, our proposed dominant modality selection significantly enhances cross-retrieval accuracy.
