Table of Contents
Fetching ...

Music Similarity Representation Learning Focusing on Individual Instruments with Source Separation and Human Preference

Takehiro Imamura, Yuka Hashizume, Wen-Chin Huang, Tomoki Toda

TL;DR

This work targets music similarity representation learning at the instrument level (InMSRL) by leveraging music source separation and human preference signals. It introduces Cascade-FT to end-to-end fine-tune MSS and instrument-specific extractors, Direct-Reconst to jointly learn disentangled features with reconstruction losses, and PAFT to align representations with human perceptual similarity using ABX data. Empirical results on the Slakh dataset show Cascade-FT with PAFT achieves the best perceptual and cross-piece InMSRL performance, while Direct-Reconst benefits from multi-task learning and data augmentation. The findings indicate that joint optimization and perceptual supervision help overcome MSS errors and improve instrument-specific similarity representations for retrieval and recommendation tasks.

Abstract

This paper proposes music similarity representation learning (MSRL) based on individual instrument sounds (InMSRL) utilizing music source separation (MSS) and human preference without requiring clean instrument sounds during inference. We propose three methods that effectively improve performance. First, we introduce end-to-end fine-tuning (E2E-FT) for the Cascade approach that sequentially performs MSS and music similarity feature extraction. E2E-FT allows the model to minimize the adverse effects of a separation error on the feature extraction. Second, we propose multi-task learning for the Direct approach that directly extracts disentangled music similarity features using a single music similarity feature extractor. Multi-task learning, which is based on the disentangled music similarity feature extraction and MSS based on reconstruction with disentangled music similarity features, further enhances instrument feature disentanglement. Third, we employ perception-aware fine-tuning (PAFT). PAFT utilizes human preference, allowing the model to perform InMSRL aligned with human perceptual similarity. We conduct experimental evaluations and demonstrate that 1) E2E-FT for Cascade significantly improves InMSRL performance, 2) the multi-task learning for Direct is also helpful to improve disentanglement performance in the feature extraction, 3) PAFT significantly enhances the perceptual InMSRL performance, and 4) Cascade with E2E-FT and PAFT outperforms Direct with the multi-task learning and PAFT.

Music Similarity Representation Learning Focusing on Individual Instruments with Source Separation and Human Preference

TL;DR

This work targets music similarity representation learning at the instrument level (InMSRL) by leveraging music source separation and human preference signals. It introduces Cascade-FT to end-to-end fine-tune MSS and instrument-specific extractors, Direct-Reconst to jointly learn disentangled features with reconstruction losses, and PAFT to align representations with human perceptual similarity using ABX data. Empirical results on the Slakh dataset show Cascade-FT with PAFT achieves the best perceptual and cross-piece InMSRL performance, while Direct-Reconst benefits from multi-task learning and data augmentation. The findings indicate that joint optimization and perceptual supervision help overcome MSS errors and improve instrument-specific similarity representations for retrieval and recommendation tasks.

Abstract

This paper proposes music similarity representation learning (MSRL) based on individual instrument sounds (InMSRL) utilizing music source separation (MSS) and human preference without requiring clean instrument sounds during inference. We propose three methods that effectively improve performance. First, we introduce end-to-end fine-tuning (E2E-FT) for the Cascade approach that sequentially performs MSS and music similarity feature extraction. E2E-FT allows the model to minimize the adverse effects of a separation error on the feature extraction. Second, we propose multi-task learning for the Direct approach that directly extracts disentangled music similarity features using a single music similarity feature extractor. Multi-task learning, which is based on the disentangled music similarity feature extraction and MSS based on reconstruction with disentangled music similarity features, further enhances instrument feature disentanglement. Third, we employ perception-aware fine-tuning (PAFT). PAFT utilizes human preference, allowing the model to perform InMSRL aligned with human perceptual similarity. We conduct experimental evaluations and demonstrate that 1) E2E-FT for Cascade significantly improves InMSRL performance, 2) the multi-task learning for Direct is also helpful to improve disentanglement performance in the feature extraction, 3) PAFT significantly enhances the perceptual InMSRL performance, and 4) Cascade with E2E-FT and PAFT outperforms Direct with the multi-task learning and PAFT.

Paper Structure

This paper contains 30 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of pseudo-musical-pieces. Instruments of the same color and the same ID indicate sample segments extracted from the same musical piece. This figure illustrates an example of the pseudo-musical-pieces created for learning focusing on drums.
  • Figure 2: Overview of Cascade-FT model.
  • Figure 3: Overview of Direct-Reconst model. The same color of inputs and outputs of the networks indicate the segments extracted from the same musical pieces.
  • Figure 4: Difference between MES-Normal and MES-Pseudo. The top part of the figure shows MES-Normal, and the bottom part shows MES-Pseudo. This is the example of evaluation for the drums. Instruments of the same color and the same ID indicate segments extracted from the same musical piece.
  • Figure 5: Visualization results of the music similarity features for pseudo-musical-pieces. In visualization, the music identification for the target instrument is represented by colors, while that for non-target instruments is represented by shapes. In this setting, the aggregation of music similarity features with the same color but the different shapes indicates that the model focuses only on the feature of target instrument. In contrast, the aggregation of music similarity features with the same shape but different colors indicates that the model focuses on the features of non-target instrument, while the aggregation of music similarity features with the same shape and color indicates that the model focuses on the features of overall musical pieces. The music similarity feature vectors were compressed to 2 dimention vectors by t-SNE tsne.