Table of Contents
Fetching ...

Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

Yuka Hashizume, Li Li, Atsushi Miyashita, Tomoki Toda

TL;DR

This work tackles flexible music similarity by enabling focus on individual instrumental elements within a single inference model. It introduces a Conditional Similarity Network that partitions a shared embedding into instrument-specific subspaces via masks, trained with a masked triplet loss and augmented by pseudo-mixed data and an auxiliary loss from isolated-instrument embeddings. Empirical results show improved embedding quality over separated-input baselines, with subspaces that retain instrument-specific characteristics and perceptual alignment for drums and guitar. The approach offers practical benefits for music recommendation and retrieval systems by enabling instrument-aware similarity without requiring isolated instrument signals at query time.

Abstract

To achieve a flexible recommendation and retrieval system, it is desirable to calculate music similarity by focusing on multiple partial elements of musical pieces and allowing the users to select the element they want to focus on. A previous study proposed using multiple individual networks for calculating music similarity based on each instrumental sound, but it is impractical to use each signal as a query in search systems. Using separated instrumental sounds alternatively resulted in less accuracy due to artifacts. In this paper, we propose a method to compute similarities focusing on each instrumental sound with a single network that takes mixed sounds as input instead of individual instrumental sounds. Specifically, we design a single similarity embedding space with disentangled dimensions for each instrument, extracted by Conditional Similarity Networks, which is trained by the triplet loss using masks. Experimental results have shown that (1) the proposed method can obtain more accurate feature representation than using individual networks using separated sounds as input, (2) each sub-embedding space can hold the characteristics of the corresponding instrument, and (3) the selection of similar musical pieces focusing on each instrumental sound by the proposed method can obtain human consent, especially in drums and guitar.

Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

TL;DR

This work tackles flexible music similarity by enabling focus on individual instrumental elements within a single inference model. It introduces a Conditional Similarity Network that partitions a shared embedding into instrument-specific subspaces via masks, trained with a masked triplet loss and augmented by pseudo-mixed data and an auxiliary loss from isolated-instrument embeddings. Empirical results show improved embedding quality over separated-input baselines, with subspaces that retain instrument-specific characteristics and perceptual alignment for drums and guitar. The approach offers practical benefits for music recommendation and retrieval systems by enabling instrument-aware similarity without requiring isolated instrument signals at query time.

Abstract

To achieve a flexible recommendation and retrieval system, it is desirable to calculate music similarity by focusing on multiple partial elements of musical pieces and allowing the users to select the element they want to focus on. A previous study proposed using multiple individual networks for calculating music similarity based on each instrumental sound, but it is impractical to use each signal as a query in search systems. Using separated instrumental sounds alternatively resulted in less accuracy due to artifacts. In this paper, we propose a method to compute similarities focusing on each instrumental sound with a single network that takes mixed sounds as input instead of individual instrumental sounds. Specifically, we design a single similarity embedding space with disentangled dimensions for each instrument, extracted by Conditional Similarity Networks, which is trained by the triplet loss using masks. Experimental results have shown that (1) the proposed method can obtain more accurate feature representation than using individual networks using separated sounds as input, (2) each sub-embedding space can hold the characteristics of the corresponding instrument, and (3) the selection of similar musical pieces focusing on each instrumental sound by the proposed method can obtain human consent, especially in drums and guitar.
Paper Structure (26 sections, 6 equations, 9 figures, 3 tables)

This paper contains 26 sections, 6 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Overview of the proposed method. The $x^{(a)}$, $x^{(p)}$, and $x^{(n)}$ denote the anchor, positive, and negative samples, respectively. The ‘Dr.,' ‘Ba.,' ‘Pi,' ‘Gu.' and ‘Ot.' are drums, bass, piano, guitar, and others, respectively. The $\boldsymbol{A}$, $\boldsymbol{B}$, and $\boldsymbol{C}$ indicate the ID of the musical piece in which each instrumental sound is originally included. This figure shows an example of setting the condition to guitar, where an anchor sample "a" and a positive sample "b" are extracted respectively from two pseudo-mixed pieces $A^{(gu)}_B$ and $A^{(gu)}_C$containing different segments of the guitar sounds of the same piece A. From each sample, the embedded representation is extracted by the network and is masked so that the dimension to which the guitar is assigned only validates in the triplet loss calculation.
  • Figure 2: Interchanged triplet to be used in addition to the basic triplet. The negative sample "c" in Fig. 1 is used as a positive sample, and the positive sample "b" in Fig. \ref{['prop']} is used as a negative sample by setting the condition to another except for guitar, e.g., bass.
  • Figure 3: Generation of target embeddings for auxiliary loss calculation. The upper figure shows the target embedding generation for the mixed sounds. The lower figure shows that for each of the individual instrumental sounds, which is used for the pretraining only.
  • Figure 4: Network architecture of the network. The ‘c,' ‘k,' and ‘s' denote the channel number, kernel size, and stride. “Conv," and “FC" denote the convolutional and fully connected layers, respectively. “BN” means batch normalization. “Mean(t)” means to take an average in the time direction. The numbers above input, output, and “FC" are their sizes.
  • Figure 5: The drums' subspace
  • ...and 4 more figures