Table of Contents
Fetching ...

JEPOO: Highly Accurate Joint Estimation of Pitch, Onset and Offset for Music Information Retrieval

Haojie Wei, Jun Yuan, Rui Zhang, Yueguo Chen, Gang Wang

TL;DR

The paper tackles melody extraction by jointly estimating pitch, onset, and offset across single-pitch, multi-pitch, and mixed data. It introduces JEPOO, a model that combines shared representations, task-specific predictors, and a fusion mechanism to leverage onset/offset cues for improved pitch prediction. A key contribution is Pareto Modulated Loss with Loss Weight Regularization (PML with LWR), which unifies Pareto-based task weighting with focal-like sample weighting while keeping training cost manageable. Empirical results show JEPOO achieves up to 10.6%/8.3%/10.3% improvements on Pitch/Onset/Offset and demonstrates robustness across datasets, instruments, and SP/MP data mixtures, underscoring its practical impact for music information retrieval tasks.

Abstract

Melody extraction is a core task in music information retrieval, and the estimation of pitch, onset and offset are key sub-tasks in melody extraction. Existing methods have limited accuracy, and work for only one type of data, either single-pitch or multipitch. In this paper, we propose a highly accurate method for joint estimation of pitch, onset and offset, named JEPOO. We address the challenges of joint learning optimization and handling both single-pitch and multi-pitch data through novel model design and a new optimization technique named Pareto modulated loss with loss weight regularization. This is the first method that can accurately handle both single-pitch and multi-pitch music data, and even a mix of them. A comprehensive experimental study on a wide range of real datasets shows that JEPOO outperforms state-ofthe-art methods by up to 10.6%, 8.3% and 10.3% for the prediction of Pitch, Onset and Offset, respectively, and JEPOO is robust for various types of data and instruments. The ablation study shows the effectiveness of each component of JEPOO.

JEPOO: Highly Accurate Joint Estimation of Pitch, Onset and Offset for Music Information Retrieval

TL;DR

The paper tackles melody extraction by jointly estimating pitch, onset, and offset across single-pitch, multi-pitch, and mixed data. It introduces JEPOO, a model that combines shared representations, task-specific predictors, and a fusion mechanism to leverage onset/offset cues for improved pitch prediction. A key contribution is Pareto Modulated Loss with Loss Weight Regularization (PML with LWR), which unifies Pareto-based task weighting with focal-like sample weighting while keeping training cost manageable. Empirical results show JEPOO achieves up to 10.6%/8.3%/10.3% improvements on Pitch/Onset/Offset and demonstrates robustness across datasets, instruments, and SP/MP data mixtures, underscoring its practical impact for music information retrieval tasks.

Abstract

Melody extraction is a core task in music information retrieval, and the estimation of pitch, onset and offset are key sub-tasks in melody extraction. Existing methods have limited accuracy, and work for only one type of data, either single-pitch or multipitch. In this paper, we propose a highly accurate method for joint estimation of pitch, onset and offset, named JEPOO. We address the challenges of joint learning optimization and handling both single-pitch and multi-pitch data through novel model design and a new optimization technique named Pareto modulated loss with loss weight regularization. This is the first method that can accurately handle both single-pitch and multi-pitch music data, and even a mix of them. A comprehensive experimental study on a wide range of real datasets shows that JEPOO outperforms state-ofthe-art methods by up to 10.6%, 8.3% and 10.3% for the prediction of Pitch, Onset and Offset, respectively, and JEPOO is robust for various types of data and instruments. The ablation study shows the effectiveness of each component of JEPOO.
Paper Structure (21 sections, 4 equations, 7 figures, 8 tables)

This paper contains 21 sections, 4 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Melody extraction.
  • Figure 2: The overall structure of JEPOO.
  • Figure 3: The details of residual convolution (ReConv) block.
  • Figure 4: Performance on synthetic test datasets with different proportion of multi-pitch data. The results of CREPE, MT3 and OAF are reproduced by using authors' open source checkpoints. OAF-retrain represents retraining OAF on synthetic train dataset.
  • Figure 5: The performance of different models with different instruments. OAF-retrain represents retraining OAF.
  • ...and 2 more figures