Table of Contents
Fetching ...

Leveraging User-Generated Metadata of Online Videos for Cover Song Identification

Simon Hachmeier, Robert Jäschke

TL;DR

This work tackles cover song identification on YouTube by framing it as a multimodal task that combines user-generated video metadata with audio-based features. It introduces three entity-resolution strategies (Fuzzy Matching, S-BERT, and Ditto) and fuses their outputs with audio CSI models (CQTNet, CoverHunter) using LambdaMART ranking to form ER-CSI ensembles. The experiments on YouTube-derived versions of SHS100K and DaTacos demonstrate that metadata-based ER can stabilize and occasionally boost CSI performance, with notable gains in MR1 and MAP, particularly under robust configurations like S-BERT with Ditto. The study highlights both the practical potential of metadata in CSI and its limitations when metadata is ambiguous or absent, suggesting directions toward larger language models for improved robustness.

Abstract

YouTube is a rich source of cover songs. Since the platform itself is organized in terms of videos rather than songs, the retrieval of covers is not trivial. The field of cover song identification addresses this problem and provides approaches that usually rely on audio content. However, including the user-generated video metadata available on YouTube promises improved identification results. In this paper, we propose a multi-modal approach for cover song identification on online video platforms. We combine the entity resolution models with audio-based approaches using a ranking model. Our findings implicate that leveraging user-generated metadata can stabilize cover song identification performance on YouTube.

Leveraging User-Generated Metadata of Online Videos for Cover Song Identification

TL;DR

This work tackles cover song identification on YouTube by framing it as a multimodal task that combines user-generated video metadata with audio-based features. It introduces three entity-resolution strategies (Fuzzy Matching, S-BERT, and Ditto) and fuses their outputs with audio CSI models (CQTNet, CoverHunter) using LambdaMART ranking to form ER-CSI ensembles. The experiments on YouTube-derived versions of SHS100K and DaTacos demonstrate that metadata-based ER can stabilize and occasionally boost CSI performance, with notable gains in MR1 and MAP, particularly under robust configurations like S-BERT with Ditto. The study highlights both the practical potential of metadata in CSI and its limitations when metadata is ambiguous or absent, suggesting directions toward larger language models for improved robustness.

Abstract

YouTube is a rich source of cover songs. Since the platform itself is organized in terms of videos rather than songs, the retrieval of covers is not trivial. The field of cover song identification addresses this problem and provides approaches that usually rely on audio content. However, including the user-generated video metadata available on YouTube promises improved identification results. In this paper, we propose a multi-modal approach for cover song identification on online video platforms. We combine the entity resolution models with audio-based approaches using a ranking model. Our findings implicate that leveraging user-generated metadata can stabilize cover song identification performance on YouTube.

Paper Structure

This paper contains 12 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Example of the input of items of the work "Yesterday" written by John Lennon and Paul McCartney. Colors in the box frames and text indicate the data source: blue stands for Secondhandsongs and red for YouTube.