Automatic Live Music Song Identification Using Multi-level Deep Sequence Similarity Learning
Aapo Hakala, Trevor Kincy, Tuomas Virtanen
TL;DR
The paper tackles automatic live music song identification by retrieving the studio version of a song from a database using a similarity-learning framework. It introduces a Siamese CNN that leverages cross-similarity matrices of multi-level deep sequences derived from CQ-spectrograms to measure cross-track similarity, enabling robust identification under live-variation conditions. Three feature extraction variants are explored with a custom live-music dataset and the Covers80 benchmark to assess generalization; the best model achieves about 87% top-1 accuracy on live data and 93.6% top-5, demonstrating the viability of deep similarity learning for live performance identification. The approach has practical implications for rights administration and metadata retrieval in live music contexts, offering a path toward automated tracking despite tempo, key, or crowd-induced variations.
Abstract
This paper studies the novel problem of automatic live music song identification, where the goal is, given a live recording of a song, to retrieve the corresponding studio version of the song from a music database. We propose a system based on similarity learning and a Siamese convolutional neural network-based model. The model uses cross-similarity matrices of multi-level deep sequences to measure musical similarity between different audio tracks. A manually collected custom live music dataset is used to test the performance of the system with live music. The results of the experiments show that the system is able to identify 87.4% of the given live music queries.
