Table of Contents
Fetching ...

On the Robustness of Cover Version Identification Models: A Study Using Cover Versions from YouTube

Simon Hachmeier, Robert Jäschke

TL;DR

This work addresses the gap in cover version identification (VI) robustness by evaluating state-of-the-art VI models on YouTube data, which presents distinct alterations from traditional SHS-based benchmarks. It introduces SHS-YT, a YouTube-derived benchmark created through multi-modal uncertainty sampling, crowdsourced annotations, and expert curation, along with a taxonomy of alterations observed in online videos. The study reveals substantial performance gaps for existing VI models on YouTube content, especially for drum-only, instrumental, and medley versions, and highlights the need for broader training data, audio stems, and improved alignment strategies. The proposed dataset and taxonomy provide a practical framework to improve VI under real-world online-video conditions, with implications for copyright detection, music retrieval, and robustness of audio-visual analysis systems.

Abstract

Recent advances in cover song identification have shown great success. However, models are usually tested on a fixed set of datasets which are relying on the online cover song database SecondHandSongs. It is unclear how well models perform on cover songs on online video platforms, which might exhibit alterations that are not expected. In this paper, we annotate a subset of songs from YouTube sampled by a multi-modal uncertainty sampling approach and evaluate state-of-the-art models. We find that existing models achieve significantly lower ranking performance on our dataset compared to a community dataset. We additionally measure the performance of different types of versions (e.g., instrumental versions) and find several types that are particularly hard to rank. Lastly, we provide a taxonomy of alterations in cover versions on the web.

On the Robustness of Cover Version Identification Models: A Study Using Cover Versions from YouTube

TL;DR

This work addresses the gap in cover version identification (VI) robustness by evaluating state-of-the-art VI models on YouTube data, which presents distinct alterations from traditional SHS-based benchmarks. It introduces SHS-YT, a YouTube-derived benchmark created through multi-modal uncertainty sampling, crowdsourced annotations, and expert curation, along with a taxonomy of alterations observed in online videos. The study reveals substantial performance gaps for existing VI models on YouTube content, especially for drum-only, instrumental, and medley versions, and highlights the need for broader training data, audio stems, and improved alignment strategies. The proposed dataset and taxonomy provide a practical framework to improve VI under real-world online-video conditions, with implications for copyright detection, music retrieval, and robustness of audio-visual analysis systems.

Abstract

Recent advances in cover song identification have shown great success. However, models are usually tested on a fixed set of datasets which are relying on the online cover song database SecondHandSongs. It is unclear how well models perform on cover songs on online video platforms, which might exhibit alterations that are not expected. In this paper, we annotate a subset of songs from YouTube sampled by a multi-modal uncertainty sampling approach and evaluate state-of-the-art models. We find that existing models achieve significantly lower ranking performance on our dataset compared to a community dataset. We additionally measure the performance of different types of versions (e.g., instrumental versions) and find several types that are particularly hard to rank. Lastly, we provide a taxonomy of alterations in cover versions on the web.
Paper Structure (25 sections, 2 equations, 6 figures, 7 tables)

This paper contains 25 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Dataset creation.
  • Figure 2: Our instructions and examples to workers as presented on MTurk. Please note that the examples of the right are cropped to fit.
  • Figure 3: Gaussian kernel density estimates for properties of the videos in the SSHS-YT dataset. The bandwith parameter is estimated by Scott's method.
  • Figure 4: Relative proportion of uncertainty class annotated.
  • Figure 5: Mean Cosine similarities of CoverHunter embeddings between YT-Versions per respective uncertainty class.
  • ...and 1 more figures