Investigation of perceptual music similarity focusing on each instrumental part
Yuka Hashizume, Tomoki Toda
TL;DR
The paper investigates perceptual music similarity as an instrument-part-specific problem to enable instrumental-part-based retrieval. It employs a large-scale ABX listening test (586 subjects) on stems from the slakh2100 dataset, evaluating timbre, rhythm, melody, and overall similarity for individual parts and mixed sounds. Building on prior work, it uses a CSN-based embedding with masked subspaces to disentangle instrument-specific similarity and assesses how well existing timbre-focused features align with human perception, finding rhythm and melody often exceed timbre in influence. The results support instrument-specific retrieval and highlight the need to model rhythmic and melodic structure for better alignment with human perceptual similarity in music systems.
Abstract
This paper presents an investigation of perceptual similarity between music tracks focusing on each individual instrumental part based on a large-scale listening test towards developing an instrumental-part-based music retrieval. In the listening test, 586 subjects evaluate the perceptual similarity of the audio tracks through an ABX test. We use the music tracks and their stems in the test set of the slakh2100 dataset. The perceptual similarity is evaluated based on four perspectives: timbre, rhythm, melody, and overall. We have analyzed the results of the listening test and have found that 1) perceptual music similarity varies depending on which instrumental part is focused on within each track; 2) rhythm and melody tend to have a larger impact on the perceptual music similarity than timbre except for the melody of drums; and 3) the previously proposed music similarity features tend to capture the perceptual similarity on timbre mainly.
