Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment
Yuka Hashizume, Li Li, Atsushi Miyashita, Tomoki Toda
TL;DR
This work tackles flexible music similarity by enabling focus on individual instrumental elements within a single inference model. It introduces a Conditional Similarity Network that partitions a shared embedding into instrument-specific subspaces via masks, trained with a masked triplet loss and augmented by pseudo-mixed data and an auxiliary loss from isolated-instrument embeddings. Empirical results show improved embedding quality over separated-input baselines, with subspaces that retain instrument-specific characteristics and perceptual alignment for drums and guitar. The approach offers practical benefits for music recommendation and retrieval systems by enabling instrument-aware similarity without requiring isolated instrument signals at query time.
Abstract
To achieve a flexible recommendation and retrieval system, it is desirable to calculate music similarity by focusing on multiple partial elements of musical pieces and allowing the users to select the element they want to focus on. A previous study proposed using multiple individual networks for calculating music similarity based on each instrumental sound, but it is impractical to use each signal as a query in search systems. Using separated instrumental sounds alternatively resulted in less accuracy due to artifacts. In this paper, we propose a method to compute similarities focusing on each instrumental sound with a single network that takes mixed sounds as input instead of individual instrumental sounds. Specifically, we design a single similarity embedding space with disentangled dimensions for each instrument, extracted by Conditional Similarity Networks, which is trained by the triplet loss using masks. Experimental results have shown that (1) the proposed method can obtain more accurate feature representation than using individual networks using separated sounds as input, (2) each sub-embedding space can hold the characteristics of the corresponding instrument, and (3) the selection of similar musical pieces focusing on each instrumental sound by the proposed method can obtain human consent, especially in drums and guitar.
