Table of Contents
Fetching ...

Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval

Paul Primus, Florian Schmid, Gerhard Widmer

TL;DR

This work tackles language-based audio retrieval by addressing the scarcity of explicit audio–caption correspondences. It introduces a two-stage training framework: first train standard dual-encoder models, then estimate non-binary audio–caption correspondences from an ensemble of Stage-1 models to distill a refined training signal in Stage 2. The approach yields significant retrieval improvements on ClothoV2 and AudioCaps, and even achieves state-of-the-art performance on ClothoV2 when scaled with WavCaps data. By reducing reliance on costly annotations and leveraging ensemble-derived targets, the method enhances cross-modal alignment and demonstrates practical impact for flexible, description-driven audio retrieval.

Abstract

Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio-caption pairs. This leads to a shared embedding space in which corresponding items from the two modalities end up close together. Since audio-caption datasets typically only contain matching pairs of recordings and descriptions, it has become common practice to create mismatching pairs by pairing the audio with a caption randomly drawn from the dataset. This is not ideal because the randomly sampled caption could, just by chance, partly or entirely describe the audio recording. However, correspondence information for all possible pairs is costly to annotate and thus typically unavailable; we, therefore, suggest substituting it with estimated correspondences. To this end, we propose a two-staged training procedure in which multiple retrieval models are first trained as usual, i.e., without estimated correspondences. In the second stage, the audio-caption correspondences predicted by these models then serve as prediction targets. We evaluate our method on the ClothoV2 and the AudioCaps benchmark and show that it improves retrieval performance, even in a restricting self-distillation setting where a single model generates and then learns from the estimated correspondences. We further show that our method outperforms the current state of the art by 1.6 pp. mAP@10 on the ClothoV2 benchmark.

Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval

TL;DR

This work tackles language-based audio retrieval by addressing the scarcity of explicit audio–caption correspondences. It introduces a two-stage training framework: first train standard dual-encoder models, then estimate non-binary audio–caption correspondences from an ensemble of Stage-1 models to distill a refined training signal in Stage 2. The approach yields significant retrieval improvements on ClothoV2 and AudioCaps, and even achieves state-of-the-art performance on ClothoV2 when scaled with WavCaps data. By reducing reliance on costly annotations and leveraging ensemble-derived targets, the method enhances cross-modal alignment and demonstrates practical impact for flexible, description-driven audio retrieval.

Abstract

Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio-caption pairs. This leads to a shared embedding space in which corresponding items from the two modalities end up close together. Since audio-caption datasets typically only contain matching pairs of recordings and descriptions, it has become common practice to create mismatching pairs by pairing the audio with a caption randomly drawn from the dataset. This is not ideal because the randomly sampled caption could, just by chance, partly or entirely describe the audio recording. However, correspondence information for all possible pairs is costly to annotate and thus typically unavailable; we, therefore, suggest substituting it with estimated correspondences. To this end, we propose a two-staged training procedure in which multiple retrieval models are first trained as usual, i.e., without estimated correspondences. In the second stage, the audio-caption correspondences predicted by these models then serve as prediction targets. We evaluate our method on the ClothoV2 and the AudioCaps benchmark and show that it improves retrieval performance, even in a restricting self-distillation setting where a single model generates and then learns from the estimated correspondences. We further show that our method outperforms the current state of the art by 1.6 pp. mAP@10 on the ClothoV2 benchmark.
Paper Structure (15 sections, 9 equations, 1 figure, 2 tables)

This paper contains 15 sections, 9 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Audio and descriptions are transformed into the shared audio--caption embedding space via the audio and description embedding models $\phi_\mathrm{a}$ and $\phi_\mathrm{c}$, respectively. In stage 1, we assume that audio $a_i$ and caption $c_j$ do not match if $i \neq j$ and train the model with contrastive loss $\mathcal{L}_{\textrm{sup}}$. Stage 2 uses predictions ensembled from several Stage 1 models (bottom left) to estimate the correspondence between $a_i$ and $c_j$; those estimates then serve as prediction targets instead of the ground truth from stage 1. Stage 2 model parameters are initialized with stage 1 parameters, and the corresponding loss is denoted as $\mathcal{L}_{\mathrm{dist}}$.