Table of Contents
Fetching ...

PESTO: Pitch Estimation with Self-supervised Transposition-equivariant Objective

Alain Riou, Stefan Lattner, Gaëtan Hadjeres, Geoffroy Peeters

TL;DR

PESTO tackles monophonic pitch estimation with limited labeled data by framing it as a multi-class SSL problem. It trains a lightweight Siamese network on pairs of pitch-shifted inputs $(\mathbf{x}, \mathbf{x}^{(k)})$ and enforces transposition equivariance through a class-based loss $\mathcal{L}_{\text{equiv}}$, complemented by an invariance loss $\mathcal{L}_{\text{inv}}$ and a shifted cross-entropy $\mathcal{L}_{\text{SCE}}$, using a transposition-preserving Toeplitz final layer. The method relies on a CQT frontend and simulates translations by cropping CQTs to avoid boundary artifacts. Evaluations on MIR-1K and MDB-stem-synth show PESTO surpasses self-supervised baselines and approaches supervised CREPE while remaining lightweight and robust to background music and domain shifts. This work demonstrates the viability of equivariance-driven SSL for real-time audio tasks and enables deployment on low-resource devices, with potential extensions to multi-pitch estimation.

Abstract

In this paper, we address the problem of pitch estimation using Self Supervised Learning (SSL). The SSL paradigm we use is equivariance to pitch transposition, which enables our model to accurately perform pitch estimation on monophonic audio after being trained only on a small unlabeled dataset. We use a lightweight ($<$ 30k parameters) Siamese neural network that takes as inputs two different pitch-shifted versions of the same audio represented by its Constant-Q Transform. To prevent the model from collapsing in an encoder-only setting, we propose a novel class-based transposition-equivariant objective which captures pitch information. Furthermore, we design the architecture of our network to be transposition-preserving by introducing learnable Toeplitz matrices. We evaluate our model for the two tasks of singing voice and musical instrument pitch estimation and show that our model is able to generalize across tasks and datasets while being lightweight, hence remaining compatible with low-resource devices and suitable for real-time applications. In particular, our results surpass self-supervised baselines and narrow the performance gap between self-supervised and supervised methods for pitch estimation.

PESTO: Pitch Estimation with Self-supervised Transposition-equivariant Objective

TL;DR

PESTO tackles monophonic pitch estimation with limited labeled data by framing it as a multi-class SSL problem. It trains a lightweight Siamese network on pairs of pitch-shifted inputs and enforces transposition equivariance through a class-based loss , complemented by an invariance loss and a shifted cross-entropy , using a transposition-preserving Toeplitz final layer. The method relies on a CQT frontend and simulates translations by cropping CQTs to avoid boundary artifacts. Evaluations on MIR-1K and MDB-stem-synth show PESTO surpasses self-supervised baselines and approaches supervised CREPE while remaining lightweight and robust to background music and domain shifts. This work demonstrates the viability of equivariance-driven SSL for real-time audio tasks and enables deployment on low-resource devices, with potential extensions to multi-pitch estimation.

Abstract

In this paper, we address the problem of pitch estimation using Self Supervised Learning (SSL). The SSL paradigm we use is equivariance to pitch transposition, which enables our model to accurately perform pitch estimation on monophonic audio after being trained only on a small unlabeled dataset. We use a lightweight ( 30k parameters) Siamese neural network that takes as inputs two different pitch-shifted versions of the same audio represented by its Constant-Q Transform. To prevent the model from collapsing in an encoder-only setting, we propose a novel class-based transposition-equivariant objective which captures pitch information. Furthermore, we design the architecture of our network to be transposition-preserving by introducing learnable Toeplitz matrices. We evaluate our model for the two tasks of singing voice and musical instrument pitch estimation and show that our model is able to generalize across tasks and datasets while being lightweight, hence remaining compatible with low-resource devices and suitable for real-time applications. In particular, our results surpass self-supervised baselines and narrow the performance gap between self-supervised and supervised methods for pitch estimation.
Paper Structure (21 sections, 12 equations, 4 figures, 3 tables)

This paper contains 21 sections, 12 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Example of $k$-transpositions. Visually, $\bold{y}$ and $\bold{y}'$ are just translated versions of each other. The sign of $k$ and its absolute value respectively indicate the direction and the distance of the translation.
  • Figure 2: Overview of the PESTO method. The input cqt frame (log-frequencies) is first cropped to produce a pair of pitch-shifted inputs $(\bold{x}, \bold{x}^{(k)})$. Then we compute $\tilde{\bold{x}}$ and $\tilde{\bold{x}}^{(k)}$ by randomly applying pitch-preserving transforms to the pair. We finally pass $\bold{x}$, $\tilde{\bold{x}}$ and $\tilde{\bold{x}}^{(k)}$ through the network $f_{\theta}$ and optimize the loss between the predicted probability distributions.
  • Figure 3: Architecture of our network $f_{\theta}$. The number of channels varies between the intermediate layers, however the frequency resolution remains unchanged until the final Toeplitz fully-connected layer.
  • Figure :

Theorems & Definitions (1)

  • Definition