Table of Contents
Fetching ...

Robust Pitch Estimation and Tracking for Speakers Based on Subband Encoding and the Generalized Labeled Multi-Bernoulli Filter

Shoufeng Lin

Abstract

This paper proposes a new pitch estimator and a novel pitch tracker for speakers. We first decompose the sound signal into subbands using an auditory filterbank, assuming time-frequency sparsity of human speech. Instead of directly selecting the number of subbands according to experience, we propose a novel frequency coverage metric to derive the number of subbands and the center frequencies of the filterbank. The subband signals are then encoded inspired by the computational auditory scene analysis (CASA) approach, and the normalized autocorrelations are calculated for pitch estimation. To suppress spurious errors and track the speaker identity, the temporal continuity constraint is exploited and a Generalized Labeled Multi-Bernoulli (GLMB) filter is adapted for pitch tracking, where we use a novel pitch state transition model based on the Ornstein-Uhlenbeck process, and the measurement driven birth model for adaptive new births of pitch targets. Experimental evaluations with various additive noises demonstrate that the proposed methods have achieved better accuracy compared with several state-of-the-art pitch estimation methods in most studied scenarios. Tests using real recordings in a reverberant room also show that the proposed method is robust against reverberation.

Robust Pitch Estimation and Tracking for Speakers Based on Subband Encoding and the Generalized Labeled Multi-Bernoulli Filter

Abstract

This paper proposes a new pitch estimator and a novel pitch tracker for speakers. We first decompose the sound signal into subbands using an auditory filterbank, assuming time-frequency sparsity of human speech. Instead of directly selecting the number of subbands according to experience, we propose a novel frequency coverage metric to derive the number of subbands and the center frequencies of the filterbank. The subband signals are then encoded inspired by the computational auditory scene analysis (CASA) approach, and the normalized autocorrelations are calculated for pitch estimation. To suppress spurious errors and track the speaker identity, the temporal continuity constraint is exploited and a Generalized Labeled Multi-Bernoulli (GLMB) filter is adapted for pitch tracking, where we use a novel pitch state transition model based on the Ornstein-Uhlenbeck process, and the measurement driven birth model for adaptive new births of pitch targets. Experimental evaluations with various additive noises demonstrate that the proposed methods have achieved better accuracy compared with several state-of-the-art pitch estimation methods in most studied scenarios. Tests using real recordings in a reverberant room also show that the proposed method is robust against reverberation.

Paper Structure

This paper contains 26 sections, 57 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: An example of the frequency coverage metric (using the Gammatone filters). The frequency range is 60Hz to 1270Hz, thus for $\eta_c=1$ there are 18 subbands and the -3dB passbands of resulting subband filters align.
  • Figure 2: Pitch encoding template (top panel), a subband signal from the filterbank, ${x}^{(b)}$, its half-wave rectified $\hat{x}^{(b)}$ and encoded signal (middle panel) ${x}_e^{(b)}$, and normalized autocorrelation coefficient of respective signals (bottom panel).
  • Figure 3: Pitch estimation results (female speech with babble noise, SNR=5dB). Left column gives the pitch estimation results from proposed method. Right column shows the pitch estimation results using the autocorrelation of raw subband signals.
  • Figure 4: From top to bottom: waveform of the speech signal with babble noise at SNR of 20dB (top panel), pitch ground truth of clean speech signal and pitch estimation results from the RAPT, STRAIGHT, YIN, PEFAC, SHRP and the proposed methods, respectively.
  • Figure 5: From top to bottom: waveform of the speech signal with babble noise at SNR of 5dB (top panel), pitch ground truth of clean speech signal and pitch estimation results from the RAPT, STRAIGHT, YIN, PEFAC, SHRP and the proposed methods, respectively.
  • ...and 6 more figures

Theorems & Definitions (2)

  • proof
  • proof