Bridging Language Gaps in Audio-Text Retrieval
Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang
TL;DR
This work tackles multilingual audio-text retrieval by addressing English-centric bias through language enhancement (LE) with a multilingual text encoder (SONAR) and by strengthening the audio encoder with consistent ensemble distillation (CED) to handle variable-length audio. It builds a CLAP-based bi-encoder framework augmented with mixture LE, translating English captions into seven additional languages for multilingual training and evaluating with cross-modal contrastive learning using similarity $s$ and the InfoNCE loss $\mathcal{L}$. The authors demonstrate state-of-the-art results on English benchmarks AudioCaps and Clotho and show promising multilingual retrieval across seven languages with limited extra data, aided by data augmentation from translation and mixture LE. The approach broadens practical applicability for real-world multilingual audio search and provides a publicly available implementation, highlighting the value of multilingual data and advanced text encoders in cross-modal retrieval.
Abstract
Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multilingual text encoder (SONAR) to encode the text data with language-specific information. Additionally, we optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the approach exhibits proficiency in retrieving content in seven other languages with only 10% of additional language-enhanced training data, yielding promising results. The source code is publicly available https://github.com/zyyan4/ml-clap.
