Table of Contents
Fetching ...

Unsupervised Harmonic Parameter Estimation Using Differentiable DSP and Spectral Optimal Transport

Bernardo Torres, Geoffroy Peeters, Gaël Richard

TL;DR

This work tackles unsupervised harmonic parameter estimation by introducing Spectral Optimal Transport (SOT) as a horizontal spectral loss for end-to-end audio modeling. It builds an unsupervised autoencoder that jointly estimates the fundamental frequency $f_0$ and harmonic amplitudes from a CQT input and reconstructs the signal via a differentiable harmonic synthesizer. Compared to the traditional Multi-Scale Spectral (MSS) loss on synthetic data, SOT improves robustness and enables better frequency localization through a Wasserstein-based spectral comparison, though it remains sensitive to small spectral variations and may require integration with vertical losses for real-world data. Overall, SOT presents a promising direction for improving unsupervised parameter estimation in neural audio applications and reducing the need for external pitch trackers.

Abstract

In neural audio signal processing, pitch conditioning has been used to enhance the performance of synthesizers. However, jointly training pitch estimators and synthesizers is a challenge when using standard audio-to-audio reconstruction loss, leading to reliance on external pitch trackers. To address this issue, we propose using a spectral loss function inspired by optimal transportation theory that minimizes the displacement of spectral energy. We validate this approach through an unsupervised autoencoding task that fits a harmonic template to harmonic signals. We jointly estimate the fundamental frequency and amplitudes of harmonics using a lightweight encoder and reconstruct the signals using a differentiable harmonic synthesizer. The proposed approach offers a promising direction for improving unsupervised parameter estimation in neural audio applications.

Unsupervised Harmonic Parameter Estimation Using Differentiable DSP and Spectral Optimal Transport

TL;DR

This work tackles unsupervised harmonic parameter estimation by introducing Spectral Optimal Transport (SOT) as a horizontal spectral loss for end-to-end audio modeling. It builds an unsupervised autoencoder that jointly estimates the fundamental frequency and harmonic amplitudes from a CQT input and reconstructs the signal via a differentiable harmonic synthesizer. Compared to the traditional Multi-Scale Spectral (MSS) loss on synthetic data, SOT improves robustness and enables better frequency localization through a Wasserstein-based spectral comparison, though it remains sensitive to small spectral variations and may require integration with vertical losses for real-world data. Overall, SOT presents a promising direction for improving unsupervised parameter estimation in neural audio applications and reducing the need for external pitch trackers.

Abstract

In neural audio signal processing, pitch conditioning has been used to enhance the performance of synthesizers. However, jointly training pitch estimators and synthesizers is a challenge when using standard audio-to-audio reconstruction loss, leading to reliance on external pitch trackers. To address this issue, we propose using a spectral loss function inspired by optimal transportation theory that minimizes the displacement of spectral energy. We validate this approach through an unsupervised autoencoding task that fits a harmonic template to harmonic signals. We jointly estimate the fundamental frequency and amplitudes of harmonics using a lightweight encoder and reconstruct the signals using a differentiable harmonic synthesizer. The proposed approach offers a promising direction for improving unsupervised parameter estimation in neural audio applications.
Paper Structure (14 sections, 8 equations, 2 figures, 1 table)

This paper contains 14 sections, 8 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Cumulative sum and quantile functions of two measures $\alpha$ and $\beta$ representing harmonic spectra with different $f_0$s across the Nyquist range [$0$-$0.5$]. The grey lines depict the pointwise differences between the inverse CDFs as per Eq. \ref{['eq:wasserstein_discrete']}, with the shaded region representing the Wasserstein metric $\mathcal{W}_1$. Green connectors denote the optimal plan ($\mathbf{P}^*_{ij}$), weighted by their respective magnitudes.
  • Figure 2: Normalized spectral Single-Scale (SS), Multi-Scale (MSS) and proposed SOT $\mathcal{W}_2$ losses as a function of the frequency shift between two $16$ KHz sinusoidal signals of $4096$ samples. The reference sinusoid has a frequency of $4000$ Hz. SS is computed for window size $\gamma=1024$ and MSS with scales $\Gamma=\{2^k\}_{k=6,\dots ,11}$.