Table of Contents
Fetching ...

Toward Fully Self-Supervised Multi-Pitch Estimation

Frank Cwitkowitz, Zhiyao Duan

TL;DR

This work tackles multi-pitch estimation under data scarcity by introducing a fully self-supervised framework (SS-MPE) that learns from synthetic monophonic notes to detect all $F_0$ activity in polyphonic mixtures. It leverages HCQT features and a convolutional autoencoder, guided by three objective families—energy concentration, timbre invariance, and geometric equivariance—to produce multi-pitch salience-grams $\hat{Y}$ without any labeled data. Empirically, SS-MPE approaches supervised MPE performance on multiple datasets, demonstrating strong generalization from monophonic NSynth training to complex polyphonic audio, and highlighting the potential of SSL for scalable MIR. The approach reduces reliance on annotated polyphonic data and opens pathways to scale MPE across instruments and real-world audio, with open-source resources facilitating reproducibility and further development.

Abstract

Multi-pitch estimation is a decades-long research problem involving the detection of pitch activity associated with concurrent musical events within multi-instrument mixtures. Supervised learning techniques have demonstrated solid performance on more narrow characterizations of the task, but suffer from limitations concerning the shortage of large-scale and diverse polyphonic music datasets with multi-pitch annotations. We present a suite of self-supervised learning objectives for multi-pitch estimation, which encourage the concentration of support around harmonics, invariance to timbral transformations, and equivariance to geometric transformations. These objectives are sufficient to train an entirely convolutional autoencoder to produce multi-pitch salience-grams directly, without any fine-tuning. Despite training exclusively on a collection of synthetic single-note audio samples, our fully self-supervised framework generalizes to polyphonic music mixtures, and achieves performance comparable to supervised models trained on conventional multi-pitch datasets.

Toward Fully Self-Supervised Multi-Pitch Estimation

TL;DR

This work tackles multi-pitch estimation under data scarcity by introducing a fully self-supervised framework (SS-MPE) that learns from synthetic monophonic notes to detect all activity in polyphonic mixtures. It leverages HCQT features and a convolutional autoencoder, guided by three objective families—energy concentration, timbre invariance, and geometric equivariance—to produce multi-pitch salience-grams without any labeled data. Empirically, SS-MPE approaches supervised MPE performance on multiple datasets, demonstrating strong generalization from monophonic NSynth training to complex polyphonic audio, and highlighting the potential of SSL for scalable MIR. The approach reduces reliance on annotated polyphonic data and opens pathways to scale MPE across instruments and real-world audio, with open-source resources facilitating reproducibility and further development.

Abstract

Multi-pitch estimation is a decades-long research problem involving the detection of pitch activity associated with concurrent musical events within multi-instrument mixtures. Supervised learning techniques have demonstrated solid performance on more narrow characterizations of the task, but suffer from limitations concerning the shortage of large-scale and diverse polyphonic music datasets with multi-pitch annotations. We present a suite of self-supervised learning objectives for multi-pitch estimation, which encourage the concentration of support around harmonics, invariance to timbral transformations, and equivariance to geometric transformations. These objectives are sufficient to train an entirely convolutional autoencoder to produce multi-pitch salience-grams directly, without any fine-tuning. Despite training exclusively on a collection of synthetic single-note audio samples, our fully self-supervised framework generalizes to polyphonic music mixtures, and achieves performance comparable to supervised models trained on conventional multi-pitch datasets.
Paper Structure (17 sections, 8 equations, 11 figures, 1 table)

This paper contains 17 sections, 8 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Example note signal (left) and its corresponding power spectrogram (right) from the NSynth dataset engel2017neural, which is used to train the proposed self-supervised framework.
  • Figure 2: HCQT power spectrogram (in dB) for an audio signal (top-left), the corresponding first-harmonic channel (bottom-left) and weighted harmonic average (top-right), and an example multi-pitch salience-gram estimate (bottom-right). Self-supervised objectives ($\mathcal{L}_{har}$ and $\mathcal{L}_{sup}$) are employed to encourage the model to concentrate energy around strong F0 candidates.
  • Figure 3: Examples (by row) of sampled Gaussian equalization curves (center) applied to CQT spectrograms (left) to produce equalized spectrograms (right) for the timbre-invariance objective.
  • Figure 4: Examples (by row) of sampled geometric transformations (center) applied to CQT spectrograms (left) to produce transformed spectrograms (right) for the geometric-equivariance objective.
  • Figure 5: CQT spectrogram $X_1$ for track 01-AchGottundHerr of Bach10 duan2010multiple along with the multi-pitch salience-gram output generated using our full self-supervised framework (SS-MPE) and the output for the framework with training objective ablations.
  • ...and 6 more figures