Table of Contents
Fetching ...

Scaling Audio-Text Retrieval with Multimodal Large Language Models

Jilan Xu, Carl Thomé, Danijela Horak, Weidi Xie, Andrew Zisserman

TL;DR

AuroLA advances audio–text retrieval by re-purposing Multimodal Large Language Models as a unified backbone, coupling them with a diverse AudioVerse dataset that provides multi-granular captions. It introduces Hybrid-NCE to align audio and text across multiple granularities and a bidirectional re-ranking module to refine top candidates through deep cross-modal interaction. The approach achieves state-of-the-art results across six benchmarks with data efficiency near 1% of prior methods and exhibits clear scaling benefits with larger datasets and model capacity. Together, these components establish a scalable, data-efficient paradigm for cross-modal retrieval that leverages the reasoning and generation capabilities of MLLMs.

Abstract

Audio-text retrieval is crucial for bridging acoustic signals and natural language. While contrastive dual-encoder architectures like CLAP have shown promise, they are fundamentally limited by the capacity of small-scale encoders. Specifically, the text encoders struggle to understand complex queries that require reasoning or world knowledge. In this paper, we propose AuroLA, a novel contrastive language-audio pre-training framework that re-purposes Multimodal Large Language Models (MLLMs) as a unified backbone for retrieval. Specifically, we make three contributions: (i) we construct a scalable data pipeline that curates diverse audio from multiple sources and generates multi-granular captions, ranging from long descriptions to structured tags, via automated annotation; (ii) we adapt an MLLM for retrieval by prompting it to summarize the audio/text input and using the hidden state of a special token as audio/text embeddings. For model training, we devise a novel Hybrid-NCE loss, which employs multi-granular supervision and hard-negative reweighting to robustly align audio with diverse textual supervision; and (iii) we design an MLLM-based bidirectional re-ranking module that refines retrieval candidates through deep cross-modal interaction. Extensive experiments demonstrate that AuroLA consistently outperforms state-of-the-art models, including the recent PE-AV, while utilizing only approximately 1% of PE-AV's training data. Lastly, we observe clear scaling trends regarding dataset size and model capacity, validating the effectiveness of MLLM as a unified backbone for audio-text retrieval. Code is available at https://github.com/Jazzcharles/AuroLA.

Scaling Audio-Text Retrieval with Multimodal Large Language Models

TL;DR

AuroLA advances audio–text retrieval by re-purposing Multimodal Large Language Models as a unified backbone, coupling them with a diverse AudioVerse dataset that provides multi-granular captions. It introduces Hybrid-NCE to align audio and text across multiple granularities and a bidirectional re-ranking module to refine top candidates through deep cross-modal interaction. The approach achieves state-of-the-art results across six benchmarks with data efficiency near 1% of prior methods and exhibits clear scaling benefits with larger datasets and model capacity. Together, these components establish a scalable, data-efficient paradigm for cross-modal retrieval that leverages the reasoning and generation capabilities of MLLMs.

Abstract

Audio-text retrieval is crucial for bridging acoustic signals and natural language. While contrastive dual-encoder architectures like CLAP have shown promise, they are fundamentally limited by the capacity of small-scale encoders. Specifically, the text encoders struggle to understand complex queries that require reasoning or world knowledge. In this paper, we propose AuroLA, a novel contrastive language-audio pre-training framework that re-purposes Multimodal Large Language Models (MLLMs) as a unified backbone for retrieval. Specifically, we make three contributions: (i) we construct a scalable data pipeline that curates diverse audio from multiple sources and generates multi-granular captions, ranging from long descriptions to structured tags, via automated annotation; (ii) we adapt an MLLM for retrieval by prompting it to summarize the audio/text input and using the hidden state of a special token as audio/text embeddings. For model training, we devise a novel Hybrid-NCE loss, which employs multi-granular supervision and hard-negative reweighting to robustly align audio with diverse textual supervision; and (iii) we design an MLLM-based bidirectional re-ranking module that refines retrieval candidates through deep cross-modal interaction. Extensive experiments demonstrate that AuroLA consistently outperforms state-of-the-art models, including the recent PE-AV, while utilizing only approximately 1% of PE-AV's training data. Lastly, we observe clear scaling trends regarding dataset size and model capacity, validating the effectiveness of MLLM as a unified backbone for audio-text retrieval. Code is available at https://github.com/Jazzcharles/AuroLA.
Paper Structure (23 sections, 13 equations, 8 figures, 11 tables)

This paper contains 23 sections, 13 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Data processing pipeline. We assemble audio from diverse platforms and datasets. Qwen3-Omni-30B-A3B qwenomni3 is used to generate multi-granular captions based on raw audio, task instructions, few-shot examples and auxiliary textual clues.
  • Figure 2: Overall architecture of our unified MLLM-based retrieval model (left) and re-ranking model (right). The retrieval model is trained by aligning the embedding tokens of audio and text inputs via a novel Hybrid-NCE loss. The re-ranking model is trained to judge pairwise audio-text matching with cross-modal interactions, effectively refining initial retrieval results.
  • Figure 3: Comparison between different losses. InfoNCE only pulls paired audio and captions closer, while pushing the remaining pairs away. In contrast, Hybrid-NCE additionally pulls potential positive tag captions closer and pushes hard-negative samples further via adaptive reweighting.
  • Figure 4: Scaling trends for pre-training data (1% to 100%) and model size (3B vs 7B).
  • Figure 5: Distributions of audio and text embeddings. The lines connect the paired audio and text. Maximum Mean Discrepancy shows the alignment between two modalities (lower indicates more aligned).
  • ...and 3 more figures