Table of Contents
Fetching ...

PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective

Alain Riou, Bernardo Torres, Ben Hayes, Stefan Lattner, Gaëtan Hadjeres, Gaël Richard, Geoffroy Peeters

TL;DR

PESTO presents a self-supervised, real-time pitch estimator that operates on a Variable-Q Transform frontend and enforces transposition-equivariant learning via a Siamese objective and a Toeplitz final layer. It achieves state-of-the-art SSL performance and competitive supervised results on music and speech benchmarks while using far fewer parameters and maintaining real-time latency. The approach supports streaming via cache-friendly convolutions and buffer-refilling strategies, enabling practical deployment. Extensive ablations show the benefits of VQT, equivariant losses, and robust training with background-music augmentation, and the work releases code and models to facilitate adoption and extension.

Abstract

In this paper, we introduce PESTO, a self-supervised learning approach for single-pitch estimation using a Siamese architecture. Our model processes individual frames of a Variable-$Q$ Transform (VQT) and predicts pitch distributions. The neural network is designed to be equivariant to translations, notably thanks to a Toeplitz fully-connected layer. In addition, we construct pitch-shifted pairs by translating and cropping the VQT frames and train our model with a novel class-based transposition-equivariant objective, eliminating the need for annotated data. Thanks to this architecture and training objective, our model achieves remarkable performances while being very lightweight ($130$k parameters). Evaluations on music and speech datasets (MIR-1K, MDB-stem-synth, and PTDB) demonstrate that PESTO not only outperforms self-supervised baselines but also competes with supervised methods, exhibiting superior cross-dataset generalization. Finally, we enhance PESTO's practical utility by developing a streamable VQT implementation using cached convolutions. Combined with our model's low latency (less than 10 ms) and minimal parameter count, this makes PESTO particularly suitable for real-time applications.

PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective

TL;DR

PESTO presents a self-supervised, real-time pitch estimator that operates on a Variable-Q Transform frontend and enforces transposition-equivariant learning via a Siamese objective and a Toeplitz final layer. It achieves state-of-the-art SSL performance and competitive supervised results on music and speech benchmarks while using far fewer parameters and maintaining real-time latency. The approach supports streaming via cache-friendly convolutions and buffer-refilling strategies, enabling practical deployment. Extensive ablations show the benefits of VQT, equivariant losses, and robust training with background-music augmentation, and the work releases code and models to facilitate adoption and extension.

Abstract

In this paper, we introduce PESTO, a self-supervised learning approach for single-pitch estimation using a Siamese architecture. Our model processes individual frames of a Variable- Transform (VQT) and predicts pitch distributions. The neural network is designed to be equivariant to translations, notably thanks to a Toeplitz fully-connected layer. In addition, we construct pitch-shifted pairs by translating and cropping the VQT frames and train our model with a novel class-based transposition-equivariant objective, eliminating the need for annotated data. Thanks to this architecture and training objective, our model achieves remarkable performances while being very lightweight (k parameters). Evaluations on music and speech datasets (MIR-1K, MDB-stem-synth, and PTDB) demonstrate that PESTO not only outperforms self-supervised baselines but also competes with supervised methods, exhibiting superior cross-dataset generalization. Finally, we enhance PESTO's practical utility by developing a streamable VQT implementation using cached convolutions. Combined with our model's low latency (less than 10 ms) and minimal parameter count, this makes PESTO particularly suitable for real-time applications.

Paper Structure

This paper contains 43 sections, 27 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: VQT analysis window size as a function of $\gamma$ and analysis center frequency. During the computation of the VQT, all kernels are padded to the nearest power of $2$ to the largest analysis window. The CQT corresponds to the case $\gamma=0$. Both axes are log-scaled.
  • Figure 2: Illustration of the pitch-shift process in the VQT domain. From a given frame, we construct two views by cropping equally-sized sub-frames from it, with a shift of $k$ between them. Since the frequency scale is logarithmic in the VQT domain, this translation corresponds to an approximate pitch shift of $k$ bins.
  • Figure 3: Overview of the PESTO model. Given a 1D VQT frame (displayed horizontally, where the horizontal axis corresponds to frequency), we first crop it as described in section \ref{['sec:vqt_shift']} to create a pair of pitch-shifted views $(\mathbf{x}, \mathbf{x}^{(k)})$. We then obtain $\widetilde{\mathbf{x}}$ and $\widetilde{\mathbf{x}}^{(k)}$ by randomly applying pitch-preserving transforms to the views. The neural network $f_{\theta}$ predicts pitch distributions from the different views and is trained by minimizing both an invariance loss between $\mathbf{y}$ and $\widetilde{\mathbf{y}}$ and an equivariance loss between $\widetilde{\mathbf{y}}$ and $\widetilde{\mathbf{y}}^{(k)}$.
  • Figure 4: Illustration of the latency of our model, and how to mitigate it with buffer refilling. When a new buffer is consumed, the returned prediction is the pitch of the center of the VQT frame. Therefore, there is a delay of $w/2$ between when a buffer of audio is obtained and its actual pitch is estimated. Buffer refilling places the most recent buffer at the center of the processed VQT frame, thus improving the reactivity of the model.
  • Figure 5: Comparison of pitch accuracy metrics across different datasets as a function of the VQT parameter $\gamma$. Each subplot shows test performance on a specific dataset (MDB, MIR-1K, or PTDB), with line colors and markers indicating the training dataset. Solid lines represent RPA, while dashed lines represent RCA. The points indicate the mean of the top 3 scores out of 5 runs with different random seeds.
  • ...and 1 more figures