Table of Contents
Fetching ...

Bridging Language Gaps in Audio-Text Retrieval

Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang

TL;DR

This work tackles multilingual audio-text retrieval by addressing English-centric bias through language enhancement (LE) with a multilingual text encoder (SONAR) and by strengthening the audio encoder with consistent ensemble distillation (CED) to handle variable-length audio. It builds a CLAP-based bi-encoder framework augmented with mixture LE, translating English captions into seven additional languages for multilingual training and evaluating with cross-modal contrastive learning using similarity $s$ and the InfoNCE loss $\mathcal{L}$. The authors demonstrate state-of-the-art results on English benchmarks AudioCaps and Clotho and show promising multilingual retrieval across seven languages with limited extra data, aided by data augmentation from translation and mixture LE. The approach broadens practical applicability for real-world multilingual audio search and provides a publicly available implementation, highlighting the value of multilingual data and advanced text encoders in cross-modal retrieval.

Abstract

Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multilingual text encoder (SONAR) to encode the text data with language-specific information. Additionally, we optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the approach exhibits proficiency in retrieving content in seven other languages with only 10% of additional language-enhanced training data, yielding promising results. The source code is publicly available https://github.com/zyyan4/ml-clap.

Bridging Language Gaps in Audio-Text Retrieval

TL;DR

This work tackles multilingual audio-text retrieval by addressing English-centric bias through language enhancement (LE) with a multilingual text encoder (SONAR) and by strengthening the audio encoder with consistent ensemble distillation (CED) to handle variable-length audio. It builds a CLAP-based bi-encoder framework augmented with mixture LE, translating English captions into seven additional languages for multilingual training and evaluating with cross-modal contrastive learning using similarity and the InfoNCE loss . The authors demonstrate state-of-the-art results on English benchmarks AudioCaps and Clotho and show promising multilingual retrieval across seven languages with limited extra data, aided by data augmentation from translation and mixture LE. The approach broadens practical applicability for real-world multilingual audio search and provides a publicly available implementation, highlighting the value of multilingual data and advanced text encoders in cross-modal retrieval.

Abstract

Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multilingual text encoder (SONAR) to encode the text data with language-specific information. Additionally, we optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the approach exhibits proficiency in retrieving content in seven other languages with only 10% of additional language-enhanced training data, yielding promising results. The source code is publicly available https://github.com/zyyan4/ml-clap.
Paper Structure (14 sections, 3 equations, 2 figures, 4 tables)

This paper contains 14 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The proposed multilingual audio-text retrieval framework. We first generate multilingual text descriptions of the training data using the SONAR text decoder, displayed on the left. Then we train a multilingual audio-retrival model based on CLAP, which can be seen on the right. Models are evaluated by translating test-captions using ChatGPT.
  • Figure 2: Multilingual evaluation results on AudioCaps, where the x-axis represents the tested target language, with translations obtained by ChatGPT. The baseline model represents training on the original, English captions, whereas "proposed" represents using mixture LE. These observations are consistent with Clotho.