Table of Contents
Fetching ...

Noise-Robust DSP-Assisted Neural Pitch Estimation with Very Low Complexity

Krishna Subramani, Jean-Marc Valin, Jan Buethe, Paris Smaragdis, Mike Goodwin

TL;DR

The paper tackles the challenge of real-time, robust pitch estimation by proposing a hybrid approach that couples a compact DNN with DSP-derived features (cross-correlation of the LPC residual and instantaneous-frequency measurements). This design achieves high accuracy comparable to end-to-end DNN methods while maintaining DSP-like latency (10 ms) and dramatically lower complexity than models like CREPE. Key contributions include three network architectures (Xcorr, IF, and Joint), a training regime that leverages CREPE-derived ground truth, and extensive evaluation showing superior noise robustness and improved neural vocoding performance. The work demonstrates that instantaneous frequency features are particularly impactful for pitch estimation and that a carefully designed hybrid system can substantially reduce computational load without sacrificing performance, enabling practical deployment in real-time speech processing systems.

Abstract

Pitch estimation is an essential step of many speech processing algorithms, including speech coding, synthesis, and enhancement. Recently, pitch estimators based on deep neural networks (DNNs) have have been outperforming well-established DSP-based techniques. Unfortunately, these new estimators can be impractical to deploy in real-time systems, both because of their relatively high complexity, and the fact that some require significant lookahead. We show that a hybrid estimator using a small deep neural network (DNN) with traditional DSP-based features can match or exceed the performance of pure DNN-based models, with a complexity and algorithmic delay comparable to traditional DSP-based algorithms. We further demonstrate that this hybrid approach can provide benefits for a neural vocoding task.

Noise-Robust DSP-Assisted Neural Pitch Estimation with Very Low Complexity

TL;DR

The paper tackles the challenge of real-time, robust pitch estimation by proposing a hybrid approach that couples a compact DNN with DSP-derived features (cross-correlation of the LPC residual and instantaneous-frequency measurements). This design achieves high accuracy comparable to end-to-end DNN methods while maintaining DSP-like latency (10 ms) and dramatically lower complexity than models like CREPE. Key contributions include three network architectures (Xcorr, IF, and Joint), a training regime that leverages CREPE-derived ground truth, and extensive evaluation showing superior noise robustness and improved neural vocoding performance. The work demonstrates that instantaneous frequency features are particularly impactful for pitch estimation and that a carefully designed hybrid system can substantially reduce computational load without sacrificing performance, enabling practical deployment in real-time speech processing systems.

Abstract

Pitch estimation is an essential step of many speech processing algorithms, including speech coding, synthesis, and enhancement. Recently, pitch estimators based on deep neural networks (DNNs) have have been outperforming well-established DSP-based techniques. Unfortunately, these new estimators can be impractical to deploy in real-time systems, both because of their relatively high complexity, and the fact that some require significant lookahead. We show that a hybrid estimator using a small deep neural network (DNN) with traditional DSP-based features can match or exceed the performance of pure DNN-based models, with a complexity and algorithmic delay comparable to traditional DSP-based algorithms. We further demonstrate that this hybrid approach can provide benefits for a neural vocoding task.
Paper Structure (8 sections, 5 equations, 3 figures, 3 tables)

This paper contains 8 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Network Architectures for pitch estimation from a) Cross-correlation (Xcorr), b) Instantaneous Frequency (IF) and c) Joint (IF + Xcorr) features. All networks output a distribution over the possible pitch values, the pitch estimate $\textrm{p}^{*}$ for a frame is the $\textrm{argmax}$ of the network output.
  • Figure 2: Histogram of PTDB reference pitch for female and male speakers. The bump in the female histogram below 125 Hz is due to period doubling errors.
  • Figure 3: RCA for different SNR values on PTDB. All of our proposed models are significantly more robust to noise than CREPE, and also outperform the purely DSP-based LPE.