Table of Contents
Fetching ...

Automatic Identification of Samples in Hip-Hop Music via Multi-Loss Training and an Artificial Dataset

Huw Cheston, Jan Van Balen, Simon Durand

TL;DR

This work tackles automatic identification of hip-hop samples by training a deep neural network on an artificial dataset created from non-commercial sources via source separation. A two-tower CNN with a joint classification and metric-learning objective learns robust embeddings that map anchor and candidate audio into a shared space, enabling retrieval and partial localization of samples. On a real-world hip-hop dataset, the approach surpasses traditional acoustic-landmark fingerprinting by about 13% in mean average precision and can locate sampled segments within ±5 seconds for roughly half of test cases. While promising, the authors discuss limitations related to dataset realism and potential shortcuts, and propose future work toward larger real-world datasets and domain-specific architectures to further improve performance and generalization.

Abstract

Sampling, the practice of reusing recorded music or sounds from another source in a new work, is common in popular music genres like hip-hop and rap. Numerous services have emerged that allow users to identify connections between samples and the songs that incorporate them, with the goal of enhancing music discovery. Designing a system that can perform the same task automatically is challenging, as samples are commonly altered with audio effects like pitch- and time-stretching and may only be seconds long. Progress on this task has been minimal and is further blocked by the limited availability of training data. Here, we show that a convolutional neural network trained on an artificial dataset can identify real-world samples in commercial hip-hop music. We extract vocal, harmonic, and percussive elements from several databases of non-commercial music recordings using audio source separation, and train the model to fingerprint a subset of these elements in transformed versions of the original audio. We optimize the model using a joint classification and metric learning loss and show that it achieves 13% greater precision on real-world instances of sampling than a fingerprinting system using acoustic landmarks, and that it can recognize samples that have been both pitch shifted and time stretched. We also show that, for half of the commercial music recordings we tested, our model is capable of locating the position of a sample to within five seconds.

Automatic Identification of Samples in Hip-Hop Music via Multi-Loss Training and an Artificial Dataset

TL;DR

This work tackles automatic identification of hip-hop samples by training a deep neural network on an artificial dataset created from non-commercial sources via source separation. A two-tower CNN with a joint classification and metric-learning objective learns robust embeddings that map anchor and candidate audio into a shared space, enabling retrieval and partial localization of samples. On a real-world hip-hop dataset, the approach surpasses traditional acoustic-landmark fingerprinting by about 13% in mean average precision and can locate sampled segments within ±5 seconds for roughly half of test cases. While promising, the authors discuss limitations related to dataset realism and potential shortcuts, and propose future work toward larger real-world datasets and domain-specific architectures to further improve performance and generalization.

Abstract

Sampling, the practice of reusing recorded music or sounds from another source in a new work, is common in popular music genres like hip-hop and rap. Numerous services have emerged that allow users to identify connections between samples and the songs that incorporate them, with the goal of enhancing music discovery. Designing a system that can perform the same task automatically is challenging, as samples are commonly altered with audio effects like pitch- and time-stretching and may only be seconds long. Progress on this task has been minimal and is further blocked by the limited availability of training data. Here, we show that a convolutional neural network trained on an artificial dataset can identify real-world samples in commercial hip-hop music. We extract vocal, harmonic, and percussive elements from several databases of non-commercial music recordings using audio source separation, and train the model to fingerprint a subset of these elements in transformed versions of the original audio. We optimize the model using a joint classification and metric learning loss and show that it achieves 13% greater precision on real-world instances of sampling than a fingerprinting system using acoustic landmarks, and that it can recognize samples that have been both pitch shifted and time stretched. We also show that, for half of the commercial music recordings we tested, our model is capable of locating the position of a sample to within five seconds.

Paper Structure

This paper contains 28 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The artificial dataset used during training. For the music recording shown here, $N = 3$ stems passed the sanity (SNR) check, with the harmony stem used as the "sample" and the vocals and drums as "non-sample". Note that the selection of stems that make up the sample and non-sample are randomized in practice.
  • Figure 2: A diagram showing how the proposed system was trained. Anchor, positive, and negative audio clips in the artificial dataset are created following the procedure outlined in Figure \ref{['fig:artificial_dataset']}.
  • Figure 3: A diagram showing how the proposed system was used during evaluation. The query and candidate towers are trained using the artificial dataset as outlined in Figure \ref{['fig:artificial_dataset']}. During our experiments, we evaluate the effect of using only one tower to encode all audio, as well as the effect of changing the hop between extracted windows.
  • Figure 4: Distribution of candidate track genres in the commercial audio dataset. Each of the 68 candidate tracks are assigned a single genre.
  • Figure 5: t-SNE plot of embedding features from a subset of commercial recordings. Recordings that contain the same sample are shown in the same color, while queries and candidates are shown using dot and cross markers, respectively. Track IDs refer to those given in vanbalen_sample_2013.
  • ...and 1 more figures