Table of Contents
Fetching ...

Interactive singing melody extraction based on active adaptation

Kavya Ranjan Saxena, Vipul Arora

TL;DR

The paper tackles domain shift in singing melody extraction by proposing an interactive, model-agnostic adaptation framework that fuses confidence-guided active learning with meta-learning. It introduces a TCP-n-based confidence model and a meta-weighting strategy to address severe class imbalance, enabling rapid adaptation to target domains with minimal annotations. The method, termed active-meta-learning (w-AML), demonstrates superior performance over non-adaptive baselines and standard meta-learning with active adaptation across multiple target datasets, and it is applicable to other melody extraction models. Additionally, the authors release the HAR dataset, facilitating future research on Hindustani singing melody extraction and cross-domain adaptation.

Abstract

Extraction of predominant pitch from polyphonic audio is one of the fundamental tasks in the field of music information retrieval and computational musicology. To accomplish this task using machine learning, a large amount of labeled audio data is required to train the model. However, a classical model pre-trained on data from one domain (source), e.g., songs of a particular singer or genre, may not perform comparatively well in extracting melody from other domains (target). The performance of such models can be boosted by adapting the model using very little annotated data from the target domain. In this work, we propose an efficient interactive melody adaptation method. Our method selects the regions in the target audio that require human annotation using a confidence criterion based on normalized true class probability. The annotations are used by the model to adapt itself to the target domain using meta-learning. Our method also provides a novel meta-learning approach that handles class imbalance, i.e., a few representative samples from a few classes are available for adaptation in the target domain. Experimental results show that the proposed method outperforms other adaptive melody extraction baselines. The proposed method is model-agnostic and hence can be applied to other non-adaptive melody extraction models to boost their performance. Also, we released a Hindustani Alankaar and Raga (HAR) dataset containing 523 audio files of about 6.86 hours of duration intended for singing melody extraction tasks.

Interactive singing melody extraction based on active adaptation

TL;DR

The paper tackles domain shift in singing melody extraction by proposing an interactive, model-agnostic adaptation framework that fuses confidence-guided active learning with meta-learning. It introduces a TCP-n-based confidence model and a meta-weighting strategy to address severe class imbalance, enabling rapid adaptation to target domains with minimal annotations. The method, termed active-meta-learning (w-AML), demonstrates superior performance over non-adaptive baselines and standard meta-learning with active adaptation across multiple target datasets, and it is applicable to other melody extraction models. Additionally, the authors release the HAR dataset, facilitating future research on Hindustani singing melody extraction and cross-domain adaptation.

Abstract

Extraction of predominant pitch from polyphonic audio is one of the fundamental tasks in the field of music information retrieval and computational musicology. To accomplish this task using machine learning, a large amount of labeled audio data is required to train the model. However, a classical model pre-trained on data from one domain (source), e.g., songs of a particular singer or genre, may not perform comparatively well in extracting melody from other domains (target). The performance of such models can be boosted by adapting the model using very little annotated data from the target domain. In this work, we propose an efficient interactive melody adaptation method. Our method selects the regions in the target audio that require human annotation using a confidence criterion based on normalized true class probability. The annotations are used by the model to adapt itself to the target domain using meta-learning. Our method also provides a novel meta-learning approach that handles class imbalance, i.e., a few representative samples from a few classes are available for adaptation in the target domain. Experimental results show that the proposed method outperforms other adaptive melody extraction baselines. The proposed method is model-agnostic and hence can be applied to other non-adaptive melody extraction models to boost their performance. Also, we released a Hindustani Alankaar and Raga (HAR) dataset containing 523 audio files of about 6.86 hours of duration intended for singing melody extraction tasks.
Paper Structure (19 sections, 11 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 19 sections, 11 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: Class imbalance in the (a) source domain (MIR1K) and different target domains (b) ADC2004, (c) MIREX05 and (d) HAR. Class 0 represents the non-voiced class and classes 1-506 represent voiced pitch classes ranging from A1(55 Hz) to B6(1975.7 Hz). Samples corresponding to non-voiced class are not shown as they are highly disproportionate in comparison to the voiced classes.
  • Figure 2: Different confidence criteria derived from the output of the base model $f_{[\phi,\theta]}$. In maximum class probability (a), the correct and incorrect predictions overlap considerably. In true class probability (b), the overlap is very small and correct and incorrect predictions are well separated. Normalized true class probability (c), serves as the ground truth for training the confidence model $f_{\psi}$ where the correct predictions are assigned a value of 1 and the incorrect predictions are in the range [0,1). In (d), we show the output of the confidence model $f_{\psi}$ when trained considering (c) as the confidence criteria.
  • Figure 3: Here, $\phi$ and $\theta$ represents the pre-trained feature extractor (F.E) layers and classifier layer respectively. $\psi$ represents parameters of the confidence model. $L_{conf}$ is calculated at a particular time frame $m=5$. Similarly, the confidence loss is calculated at every time frame and then the confidence model is trained.
  • Figure 4: Active-Meta-learning framework for polyphonic melody adaptation. In active-meta-training, for an episode $b$, ILO is performed on $T_b^S$ such that the model parameters $\theta^b$ and $\psi^b$ are updated. Further OLO is performed on $T_b^Q$ to update parameters $\theta$ and $\psi$. The same procedure is repeated for all episode $b$ in source domain. In active-meta-testing, for an episode $b'$ the episode parameters are initialized as $\theta^{b'}=\theta$ and $\theta^{b'}=\psi$ and are used to adapt on $T_{b'}^S$ (single iteration of ILO) and predict on $T_{b'}^Q$.
  • Figure 5: RPA on the query set of size $(M-sK)$ vs $s$ for a typical episode from the three target datasets. Here, $s=0$ means no adaptation is performed.