On the Robustness of Cover Version Identification Models: A Study Using Cover Versions from YouTube

Simon Hachmeier; Robert Jäschke

On the Robustness of Cover Version Identification Models: A Study Using Cover Versions from YouTube

Simon Hachmeier, Robert Jäschke

TL;DR

This work addresses the gap in cover version identification (VI) robustness by evaluating state-of-the-art VI models on YouTube data, which presents distinct alterations from traditional SHS-based benchmarks. It introduces SHS-YT, a YouTube-derived benchmark created through multi-modal uncertainty sampling, crowdsourced annotations, and expert curation, along with a taxonomy of alterations observed in online videos. The study reveals substantial performance gaps for existing VI models on YouTube content, especially for drum-only, instrumental, and medley versions, and highlights the need for broader training data, audio stems, and improved alignment strategies. The proposed dataset and taxonomy provide a practical framework to improve VI under real-world online-video conditions, with implications for copyright detection, music retrieval, and robustness of audio-visual analysis systems.

Abstract

Recent advances in cover song identification have shown great success. However, models are usually tested on a fixed set of datasets which are relying on the online cover song database SecondHandSongs. It is unclear how well models perform on cover songs on online video platforms, which might exhibit alterations that are not expected. In this paper, we annotate a subset of songs from YouTube sampled by a multi-modal uncertainty sampling approach and evaluate state-of-the-art models. We find that existing models achieve significantly lower ranking performance on our dataset compared to a community dataset. We additionally measure the performance of different types of versions (e.g., instrumental versions) and find several types that are particularly hard to rank. Lastly, we provide a taxonomy of alterations in cover versions on the web.

On the Robustness of Cover Version Identification Models: A Study Using Cover Versions from YouTube

TL;DR

Abstract

Paper Structure (25 sections, 2 equations, 6 figures, 7 tables)

This paper contains 25 sections, 2 equations, 6 figures, 7 tables.

Introduction
Related Work
Version Identification
Music on YouTube
Dataset Creation
Candidate Retrieval
Uncertainty Sampling
Modality Proxies.
Similarity and Matching Confidence Aggregation.
Disagreement Sampling.
Mutual Uncertainty.
Annotation
Relevance Classes.
Crowdsourcing.
Curation.
...and 10 more sections

Figures (6)

Figure 1: Dataset creation.
Figure 2: Our instructions and examples to workers as presented on MTurk. Please note that the examples of the right are cropped to fit.
Figure 3: Gaussian kernel density estimates for properties of the videos in the SSHS-YT dataset. The bandwith parameter is estimated by Scott's method.
Figure 4: Relative proportion of uncertainty class annotated.
Figure 5: Mean Cosine similarities of CoverHunter embeddings between YT-Versions per respective uncertainty class.
...and 1 more figures

On the Robustness of Cover Version Identification Models: A Study Using Cover Versions from YouTube

TL;DR

Abstract

On the Robustness of Cover Version Identification Models: A Study Using Cover Versions from YouTube

Authors

TL;DR

Abstract

Table of Contents

Figures (6)