Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval

Paul Primus; Gerhard Widmer

Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval

Paul Primus, Gerhard Widmer

TL;DR

This work tackles language-based audio retrieval by introducing a hybrid system that leverages audio metadata alongside audio signals to better connect textual queries with audio items. It develops two independent encoders for audio ($\phi_a$) and metadata/text ($\phi_t$), and compares late fusion and mid-level fusion strategies to form a joint item representation that is projected into a shared retrieval space. Empirical results on ClothoV2 and AudioCaps show metadata can meaningfully boost performance (e.g., up to +8.82 map@10 with Full-Sentences on ClothoV2 and +3.69 on AudioCaps), with late fusion often ranking items more effectively and mid-level fusion enabling crossmodal interactions. The study also finds that combining open- and closed-set tags helps under certain data conditions, that artificially generated captions from metadata can hurt hybrid models, and that sharing the text encoder across query and metadata improves performance, highlighting practical considerations for deploying metadata-enhanced retrieval systems.

Abstract

Matching raw audio signals with textual descriptions requires understanding the audio's content and the description's semantics and then drawing connections between the two modalities. This paper investigates a hybrid retrieval system that utilizes audio metadata as an additional clue to understand the content of audio signals before matching them with textual queries. We experimented with metadata often attached to audio recordings, such as keywords and natural-language descriptions, and we investigated late and mid-level fusion strategies to merge audio and metadata. Our hybrid approach with keyword metadata and late fusion improved the retrieval performance over a content-based baseline by 2.36 and 3.69 pp. mAP@10 on the ClothoV2 and AudioCaps benchmarks, respectively.

Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval

TL;DR

) and metadata/text (

), and compares late fusion and mid-level fusion strategies to form a joint item representation that is projected into a shared retrieval space. Empirical results on ClothoV2 and AudioCaps show metadata can meaningfully boost performance (e.g., up to +8.82 map@10 with Full-Sentences on ClothoV2 and +3.69 on AudioCaps), with late fusion often ranking items more effectively and mid-level fusion enabling crossmodal interactions. The study also finds that combining open- and closed-set tags helps under certain data conditions, that artificially generated captions from metadata can hurt hybrid models, and that sharing the text encoder across query and metadata improves performance, highlighting practical considerations for deploying metadata-enhanced retrieval systems.

Abstract

Paper Structure (18 sections, 4 figures, 2 tables)

This paper contains 18 sections, 4 figures, 2 tables.

Introduction
Related Work
Methodology
Metadata
Audio, Metadata & Query Embedding
Audio-Metadata Fusion
Experimental Setup
Datasets & Benchmarks
Pretrained Models
Optimization
Evaluation Metrics
Results & Discussion
Does the use of metadata lead to improved retrieval performance compared to a pure content-based approach?
Does hybrid retrieval benefit from modeling crossmodal interactions between audio and metadata embeddings?
How does combining open- and closed-set tags impact the performance?
...and 3 more sections

Figures (4)

Figure 1: Left: Comparison of pure metadata- and content-based methods (orange and blue, respectively) on the ClothoV2 benchmark. Right: Illustration of the multimodal retrieval space of our hybrid approach. Audio signal (blue) and metadata (orange) are embedded and fused to represent an item $(a,m)$. The similarity to an embedded query $q$ (green) is measured via distance.
Figure 2: Late Fusion of audio (blue) and metadata (orange). The fused representation is matched with the embedded query (green) via cosine similarity.
Figure 3: Mid-Level fusion: The matching of fused audio and metadata embeddings is inspired by the multimodal transformer MMT.
Figure 4: Relative frequency of the 15 most common keywords in ClothoV2 (blue) and their corresponding frequencies in the audio captions (orange).

Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval

TL;DR

Abstract

Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (4)