MAJL: A Model-Agnostic Joint Learning Framework for Music Source Separation and Pitch Estimation

Haojie Wei; Jun Yuan; Rui Zhang; Quanyu Dai; Yueguo Chen

MAJL: A Model-Agnostic Joint Learning Framework for Music Source Separation and Pitch Estimation

Haojie Wei, Jun Yuan, Rui Zhang, Quanyu Dai, Yueguo Chen

TL;DR

MAJL presents a model-agnostic joint learning framework for music source separation and pitch estimation that leverages large single-labeled datasets via a two-stage training approach and mitigates optimization conflicts through Dynamic Weights on Hard Samples. The two-stage method generates high-quality pseudo labels to expand labeled data, while DWHS adaptively weights hard/noisy samples to align MSS and PE objectives, addressing error propagation. Empirical results on MIR-1K, MedleyDB, MUSDB18, and MIR_ST500 show state-of-the-art performance for both tasks, with notable SDR and RPA gains and strong generalization across architectures. The framework’s model-agnostic nature and demonstrated robustness suggest significant practical impact for music information retrieval, enabling improved transcription and source separation in real-world, data-scarce scenarios.

Abstract

Music source separation and pitch estimation are two vital tasks in music information retrieval. Typically, the input of pitch estimation is obtained from the output of music source separation. Therefore, existing methods have tried to perform these two tasks simultaneously, so as to leverage the mutually beneficial relationship between both tasks. However, these methods still face two critical challenges that limit the improvement of both tasks: the lack of labeled data and joint learning optimization. To address these challenges, we propose a Model-Agnostic Joint Learning (MAJL) framework for both tasks. MAJL is a generic framework and can use variant models for each task. It includes a two-stage training method and a dynamic weighting method named Dynamic Weights on Hard Samples (DWHS), which addresses the lack of labeled data and joint learning optimization, respectively. Experimental results on public music datasets show that MAJL outperforms state-of-the-art methods on both tasks, with significant improvements of 0.92 in Signal-to-Distortion Ratio (SDR) for music source separation and 2.71% in Raw Pitch Accuracy (RPA) for pitch estimation. Furthermore, comprehensive studies not only validate the effectiveness of each component of MAJL, but also indicate the great generality of MAJL in adapting to different model architectures.

MAJL: A Model-Agnostic Joint Learning Framework for Music Source Separation and Pitch Estimation

TL;DR

Abstract

Paper Structure (35 sections, 20 equations, 7 figures, 9 tables)

This paper contains 35 sections, 20 equations, 7 figures, 9 tables.

Introduction
Related Work
Problem Formulation
Method
Two-Stage Training Method
Dynamic Weights on Hard Samples (DWHS)
Analysis of Different Cases in DWHS
Module of DWHS
Loss of DWHS
Experimental Setup
Experimental Results
Overall Performance
Results on Fully-labeled Dataset
Results on Single-labeled Datasets
Experiments With Different Modules
...and 20 more sections

Figures (7)

Figure 1: Existing methods for MSS and PE.
Figure 2: The overall structure of Model-Agnostic Joint Learning (MAJL) framework. Details of JCF and DWHS are shown in Figure \ref{['fig:structure']}. The synthetic data contains fully-labeled data, single-labeled data with generated pseudo labels.
Figure 3: Details of JCF and DWHS. The MSS Module and the PE Module used existing MSS and PE models respectively. The details of Dynamic Weight Module (DWM) are shown in Figure \ref{['fig:dwhs']}.
Figure 4: The model structure of DWM in DWHS.
Figure 5: Dynamic weights extracted by the DWHS.
...and 2 more figures

MAJL: A Model-Agnostic Joint Learning Framework for Music Source Separation and Pitch Estimation

TL;DR

Abstract

MAJL: A Model-Agnostic Joint Learning Framework for Music Source Separation and Pitch Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)