Table of Contents
Fetching ...

Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval

Paul Primus, Gerhard Widmer

TL;DR

This work tackles language-based audio retrieval by introducing a hybrid system that leverages audio metadata alongside audio signals to better connect textual queries with audio items. It develops two independent encoders for audio ($\phi_a$) and metadata/text ($\phi_t$), and compares late fusion and mid-level fusion strategies to form a joint item representation that is projected into a shared retrieval space. Empirical results on ClothoV2 and AudioCaps show metadata can meaningfully boost performance (e.g., up to +8.82 map@10 with Full-Sentences on ClothoV2 and +3.69 on AudioCaps), with late fusion often ranking items more effectively and mid-level fusion enabling crossmodal interactions. The study also finds that combining open- and closed-set tags helps under certain data conditions, that artificially generated captions from metadata can hurt hybrid models, and that sharing the text encoder across query and metadata improves performance, highlighting practical considerations for deploying metadata-enhanced retrieval systems.

Abstract

Matching raw audio signals with textual descriptions requires understanding the audio's content and the description's semantics and then drawing connections between the two modalities. This paper investigates a hybrid retrieval system that utilizes audio metadata as an additional clue to understand the content of audio signals before matching them with textual queries. We experimented with metadata often attached to audio recordings, such as keywords and natural-language descriptions, and we investigated late and mid-level fusion strategies to merge audio and metadata. Our hybrid approach with keyword metadata and late fusion improved the retrieval performance over a content-based baseline by 2.36 and 3.69 pp. mAP@10 on the ClothoV2 and AudioCaps benchmarks, respectively.

Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval

TL;DR

This work tackles language-based audio retrieval by introducing a hybrid system that leverages audio metadata alongside audio signals to better connect textual queries with audio items. It develops two independent encoders for audio () and metadata/text (), and compares late fusion and mid-level fusion strategies to form a joint item representation that is projected into a shared retrieval space. Empirical results on ClothoV2 and AudioCaps show metadata can meaningfully boost performance (e.g., up to +8.82 map@10 with Full-Sentences on ClothoV2 and +3.69 on AudioCaps), with late fusion often ranking items more effectively and mid-level fusion enabling crossmodal interactions. The study also finds that combining open- and closed-set tags helps under certain data conditions, that artificially generated captions from metadata can hurt hybrid models, and that sharing the text encoder across query and metadata improves performance, highlighting practical considerations for deploying metadata-enhanced retrieval systems.

Abstract

Matching raw audio signals with textual descriptions requires understanding the audio's content and the description's semantics and then drawing connections between the two modalities. This paper investigates a hybrid retrieval system that utilizes audio metadata as an additional clue to understand the content of audio signals before matching them with textual queries. We experimented with metadata often attached to audio recordings, such as keywords and natural-language descriptions, and we investigated late and mid-level fusion strategies to merge audio and metadata. Our hybrid approach with keyword metadata and late fusion improved the retrieval performance over a content-based baseline by 2.36 and 3.69 pp. mAP@10 on the ClothoV2 and AudioCaps benchmarks, respectively.
Paper Structure (18 sections, 4 figures, 2 tables)

This paper contains 18 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Left: Comparison of pure metadata- and content-based methods (orange and blue, respectively) on the ClothoV2 benchmark. Right: Illustration of the multimodal retrieval space of our hybrid approach. Audio signal (blue) and metadata (orange) are embedded and fused to represent an item $(a,m)$. The similarity to an embedded query $q$ (green) is measured via distance.
  • Figure 2: Late Fusion of audio (blue) and metadata (orange). The fused representation is matched with the embedded query (green) via cosine similarity.
  • Figure 3: Mid-Level fusion: The matching of fused audio and metadata embeddings is inspired by the multimodal transformer MMT.
  • Figure 4: Relative frequency of the 15 most common keywords in ClothoV2 (blue) and their corresponding frequencies in the audio captions (orange).