Table of Contents
Fetching ...

Automatic Live Music Song Identification Using Multi-level Deep Sequence Similarity Learning

Aapo Hakala, Trevor Kincy, Tuomas Virtanen

TL;DR

The paper tackles automatic live music song identification by retrieving the studio version of a song from a database using a similarity-learning framework. It introduces a Siamese CNN that leverages cross-similarity matrices of multi-level deep sequences derived from CQ-spectrograms to measure cross-track similarity, enabling robust identification under live-variation conditions. Three feature extraction variants are explored with a custom live-music dataset and the Covers80 benchmark to assess generalization; the best model achieves about 87% top-1 accuracy on live data and 93.6% top-5, demonstrating the viability of deep similarity learning for live performance identification. The approach has practical implications for rights administration and metadata retrieval in live music contexts, offering a path toward automated tracking despite tempo, key, or crowd-induced variations.

Abstract

This paper studies the novel problem of automatic live music song identification, where the goal is, given a live recording of a song, to retrieve the corresponding studio version of the song from a music database. We propose a system based on similarity learning and a Siamese convolutional neural network-based model. The model uses cross-similarity matrices of multi-level deep sequences to measure musical similarity between different audio tracks. A manually collected custom live music dataset is used to test the performance of the system with live music. The results of the experiments show that the system is able to identify 87.4% of the given live music queries.

Automatic Live Music Song Identification Using Multi-level Deep Sequence Similarity Learning

TL;DR

The paper tackles automatic live music song identification by retrieving the studio version of a song from a database using a similarity-learning framework. It introduces a Siamese CNN that leverages cross-similarity matrices of multi-level deep sequences derived from CQ-spectrograms to measure cross-track similarity, enabling robust identification under live-variation conditions. Three feature extraction variants are explored with a custom live-music dataset and the Covers80 benchmark to assess generalization; the best model achieves about 87% top-1 accuracy on live data and 93.6% top-5, demonstrating the viability of deep similarity learning for live performance identification. The approach has practical implications for rights administration and metadata retrieval in live music contexts, offering a path toward automated tracking despite tempo, key, or crowd-induced variations.

Abstract

This paper studies the novel problem of automatic live music song identification, where the goal is, given a live recording of a song, to retrieve the corresponding studio version of the song from a music database. We propose a system based on similarity learning and a Siamese convolutional neural network-based model. The model uses cross-similarity matrices of multi-level deep sequences to measure musical similarity between different audio tracks. A manually collected custom live music dataset is used to test the performance of the system with live music. The results of the experiments show that the system is able to identify 87.4% of the given live music queries.
Paper Structure (10 sections, 2 equations, 4 figures, 2 tables)

This paper contains 10 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: A block diagram of the model architecture. In model input CQ-spectrograms $X_1$ and $X_2$ from two separate tracks are fed to different branches of a SCNN performing the feature representation. Resulting multi-level deep sequences $A_1 \dots A_4$ and $B_1 \dots B_4$ are then used to compute level-specific CSMs $C_1 \dots C_4$. Two parallel CNNs are used for similarity measuring, followed by four fully connected layers, a single output neuron and a sigmoid activation function. The final output is a similarity score $\hat{Y} \in [0, 1]$ predicting the similarity of the given input tracks as a probability value.
  • Figure 2: The inner structure of branch $A$ of the SCNN. The data flow and the parametrization of the four convolution blocks are identical in branch $B$. The layer parameter names are abbreviated as follows: c=number of channels, k=kernel size, d=dilation, s=stride and p=dropout probability.
  • Figure 3: Visualization of CSMs. The input songs used in class 1 example are 'Losing My Religion' by R.E.M and a live recording of the same song. In class 0 example the same live recording track is compared against 'E5150' by Black Sabbath.
  • Figure 4: CNN structure and layer parameters. The layer parameter names are abbreviated as follows: c=number of channels, k=kernel size, d=dilation, s=stride and pd=padding.