Table of Contents
Fetching ...

Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation

Manh Luong, Khai Nguyen, Nhat Ho, Reza Haf, Dinh Phung, Lizhen Qu

TL;DR

This work reframes cross-modal audio-text retrieval as learning the ground metric for entropic optimal transport and introduces mini-batch Learning-to-Match (m-LTM) that scales to large datasets. It augments the OT framework with a Mahalanobis-enhanced ground metric and a Partial OT variant to handle noisy correspondences, jointly training encoders for audio and text. The approach yields state-of-the-art results on AudioCaps and Clotho, closes the modality gap, and demonstrates strong zero-shot transfer on ESC-50, with robustness to label noise. The contributions offer a scalable, robust, and transferable cross-modal retrieval paradigm with practical utility for audio-text tasks.

Abstract

The Learning-to-match (LTM) framework proves to be an effective inverse optimal transport approach for learning the underlying ground metric between two sources of data, facilitating subsequent matching. However, the conventional LTM framework faces scalability challenges, necessitating the use of the entire dataset each time the parameters of the ground metric are updated. In adapting LTM to the deep learning context, we introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems. This framework leverages mini-batch subsampling and Mahalanobis-enhanced family of ground metrics. Moreover, to cope with misaligned training data in practice, we propose a variant using partial optimal transport to mitigate the harm of misaligned data pairs in training data. We conduct extensive experiments on audio-text matching problems using three datasets: AudioCaps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance. Beyond this, the proposed m-LTM framework is able to close the modality gap across audio and text embedding, which surpasses both triplet and contrastive loss in the zero-shot sound event detection task on the ESC-50 dataset. Notably, our strategy of employing partial optimal transport with m-LTM demonstrates greater noise tolerance than contrastive loss, especially under varying noise ratios in training data on the AudioCaps dataset. Our code is available at https://github.com/v-manhlt3/m-LTM-Audio-Text-Retrieval

Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation

TL;DR

This work reframes cross-modal audio-text retrieval as learning the ground metric for entropic optimal transport and introduces mini-batch Learning-to-Match (m-LTM) that scales to large datasets. It augments the OT framework with a Mahalanobis-enhanced ground metric and a Partial OT variant to handle noisy correspondences, jointly training encoders for audio and text. The approach yields state-of-the-art results on AudioCaps and Clotho, closes the modality gap, and demonstrates strong zero-shot transfer on ESC-50, with robustness to label noise. The contributions offer a scalable, robust, and transferable cross-modal retrieval paradigm with practical utility for audio-text tasks.

Abstract

The Learning-to-match (LTM) framework proves to be an effective inverse optimal transport approach for learning the underlying ground metric between two sources of data, facilitating subsequent matching. However, the conventional LTM framework faces scalability challenges, necessitating the use of the entire dataset each time the parameters of the ground metric are updated. In adapting LTM to the deep learning context, we introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems. This framework leverages mini-batch subsampling and Mahalanobis-enhanced family of ground metrics. Moreover, to cope with misaligned training data in practice, we propose a variant using partial optimal transport to mitigate the harm of misaligned data pairs in training data. We conduct extensive experiments on audio-text matching problems using three datasets: AudioCaps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance. Beyond this, the proposed m-LTM framework is able to close the modality gap across audio and text embedding, which surpasses both triplet and contrastive loss in the zero-shot sound event detection task on the ESC-50 dataset. Notably, our strategy of employing partial optimal transport with m-LTM demonstrates greater noise tolerance than contrastive loss, especially under varying noise ratios in training data on the AudioCaps dataset. Our code is available at https://github.com/v-manhlt3/m-LTM-Audio-Text-Retrieval
Paper Structure (19 sections, 14 equations, 5 figures, 9 tables, 2 algorithms)

This paper contains 19 sections, 14 equations, 5 figures, 9 tables, 2 algorithms.

Figures (5)

  • Figure 1: The visualization of the shared embedding space between audio and text embedding on the ESC50 test set based on tSNE algorithm. Each text embedding represents a label in the test set, and each audio embedding represents an audio in the test set.
  • Figure 2: Ablation study on the AudioCaps dataset for transportation mass of m-LTM with POT.
  • Figure 3: The visualization of the true matching, and inference matching from pretrained models using contrastive loss and m-LTM loss on ten audio and ten corresponding captions from the test set of AudioCaps dataset.
  • Figure 4: Qualitative results for text-to-audio retrieval task. top-1, top-2, and top-3 retrieved audio results are from left to right in the figure. The ground-truth audio for the caption is marked in red border.
  • Figure 5: Qualitative results for text-to-audio retrieval task. top-1, top-2, and top-3 retrieved audio results are from left to right in the figure. The ground-truth audio for the caption is marked in red border.

Theorems & Definitions (2)

  • Definition 1
  • Definition 2