Table of Contents
Fetching ...

Investigation of perceptual music similarity focusing on each instrumental part

Yuka Hashizume, Tomoki Toda

TL;DR

The paper investigates perceptual music similarity as an instrument-part-specific problem to enable instrumental-part-based retrieval. It employs a large-scale ABX listening test (586 subjects) on stems from the slakh2100 dataset, evaluating timbre, rhythm, melody, and overall similarity for individual parts and mixed sounds. Building on prior work, it uses a CSN-based embedding with masked subspaces to disentangle instrument-specific similarity and assesses how well existing timbre-focused features align with human perception, finding rhythm and melody often exceed timbre in influence. The results support instrument-specific retrieval and highlight the need to model rhythmic and melodic structure for better alignment with human perceptual similarity in music systems.

Abstract

This paper presents an investigation of perceptual similarity between music tracks focusing on each individual instrumental part based on a large-scale listening test towards developing an instrumental-part-based music retrieval. In the listening test, 586 subjects evaluate the perceptual similarity of the audio tracks through an ABX test. We use the music tracks and their stems in the test set of the slakh2100 dataset. The perceptual similarity is evaluated based on four perspectives: timbre, rhythm, melody, and overall. We have analyzed the results of the listening test and have found that 1) perceptual music similarity varies depending on which instrumental part is focused on within each track; 2) rhythm and melody tend to have a larger impact on the perceptual music similarity than timbre except for the melody of drums; and 3) the previously proposed music similarity features tend to capture the perceptual similarity on timbre mainly.

Investigation of perceptual music similarity focusing on each instrumental part

TL;DR

The paper investigates perceptual music similarity as an instrument-part-specific problem to enable instrumental-part-based retrieval. It employs a large-scale ABX listening test (586 subjects) on stems from the slakh2100 dataset, evaluating timbre, rhythm, melody, and overall similarity for individual parts and mixed sounds. Building on prior work, it uses a CSN-based embedding with masked subspaces to disentangle instrument-specific similarity and assesses how well existing timbre-focused features align with human perception, finding rhythm and melody often exceed timbre in influence. The results support instrument-specific retrieval and highlight the need to model rhythmic and melodic structure for better alignment with human perceptual similarity in music systems.

Abstract

This paper presents an investigation of perceptual similarity between music tracks focusing on each individual instrumental part based on a large-scale listening test towards developing an instrumental-part-based music retrieval. In the listening test, 586 subjects evaluate the perceptual similarity of the audio tracks through an ABX test. We use the music tracks and their stems in the test set of the slakh2100 dataset. The perceptual similarity is evaluated based on four perspectives: timbre, rhythm, melody, and overall. We have analyzed the results of the listening test and have found that 1) perceptual music similarity varies depending on which instrumental part is focused on within each track; 2) rhythm and melody tend to have a larger impact on the perceptual music similarity than timbre except for the melody of drums; and 3) the previously proposed music similarity features tend to capture the perceptual similarity on timbre mainly.

Paper Structure

This paper contains 14 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: How to create one test set. Examples of how to make one sample set for the piano sound are also shown with a red line for test 1 and test 2, respectively. Sample sets are created in the same way for other instrumental sounds (and the mixed sound for the additional experiment), and the procedure is repeated four times.
  • Figure : Histgrams of the number of subjects in each test set. The horizontal axis is the set index, and the vertical axis is the number of subjects who answered the corresponding set.
  • Figure : Heatmaps of the averages of matching rates of answers between instruments for a subject. “Dr.", “Ba.", “Pi.", “Gu.", “Ot." and “Mi." represent drums, bass, piano, guitar, others, and mixed sound respectively.