Fusing Audio and Metadata Embeddings Improves Language-based Audio Retrieval
Paul Primus, Gerhard Widmer
TL;DR
This work tackles language-based audio retrieval by introducing a hybrid system that leverages audio metadata alongside audio signals to better connect textual queries with audio items. It develops two independent encoders for audio ($\phi_a$) and metadata/text ($\phi_t$), and compares late fusion and mid-level fusion strategies to form a joint item representation that is projected into a shared retrieval space. Empirical results on ClothoV2 and AudioCaps show metadata can meaningfully boost performance (e.g., up to +8.82 map@10 with Full-Sentences on ClothoV2 and +3.69 on AudioCaps), with late fusion often ranking items more effectively and mid-level fusion enabling crossmodal interactions. The study also finds that combining open- and closed-set tags helps under certain data conditions, that artificially generated captions from metadata can hurt hybrid models, and that sharing the text encoder across query and metadata improves performance, highlighting practical considerations for deploying metadata-enhanced retrieval systems.
Abstract
Matching raw audio signals with textual descriptions requires understanding the audio's content and the description's semantics and then drawing connections between the two modalities. This paper investigates a hybrid retrieval system that utilizes audio metadata as an additional clue to understand the content of audio signals before matching them with textual queries. We experimented with metadata often attached to audio recordings, such as keywords and natural-language descriptions, and we investigated late and mid-level fusion strategies to merge audio and metadata. Our hybrid approach with keyword metadata and late fusion improved the retrieval performance over a content-based baseline by 2.36 and 3.69 pp. mAP@10 on the ClothoV2 and AudioCaps benchmarks, respectively.
