Table of Contents
Fetching ...

Comparison of fundamental frequency estimators with subharmonic voice signals

Takeshi Ikuma, Melda Kunduk, Andrew J. McWhorter

TL;DR

This study addresses the challenge that subharmonic voicing can bias estimates of the speaking fundamental frequency $f_o$, which underpins many clinical acoustic metrics. It compares five estimators—Praat, Harvest, YAAPT, CREPE, and FCN-F0—on sustained vowels from the KayPENTAX Disordered Voice Database, using ground-truth annotations for $f_o$ and a quality-of-estimate framework plus SHR to quantify subharmonics. The results show that FCN-F0 and CREPE achieve the highest per-frame accuracy (≈96% and ≈95% respectively), with ACF performing worst (≈62%); deep learning models also better manage subharmonic errors across SHR ranges. These findings support employing deep-learning-based $f_o$ estimation in clinical contexts and suggest that retraining with subharmonic data could further improve performance, particularly for high SHR cases.

Abstract

In clinical voice signal analysis, mishandling of subharmonic voicing may cause an acoustic parameter to signal false negatives. As such, the ability of a fundamental frequency estimator to identify speaking fundamental frequency is critical. This paper presents a sustained-vowel study, which used a quality-of-estimate classification to identify subharmonic errors and subharmonics-to-harmonics ratio (SHR) to measure the strength of subharmonic voicing. Five estimators were studied with a sustained vowel dataset: Praat, YAAPT, Harvest, CREPE, and FCN-F0. FCN-F0, a deep-learning model, performed the best both in overall accuracy and in correctly resolving subharmonic signals. CREPE and Harvest are also highly capable estimators for sustained vowel analysis.

Comparison of fundamental frequency estimators with subharmonic voice signals

TL;DR

This study addresses the challenge that subharmonic voicing can bias estimates of the speaking fundamental frequency , which underpins many clinical acoustic metrics. It compares five estimators—Praat, Harvest, YAAPT, CREPE, and FCN-F0—on sustained vowels from the KayPENTAX Disordered Voice Database, using ground-truth annotations for and a quality-of-estimate framework plus SHR to quantify subharmonics. The results show that FCN-F0 and CREPE achieve the highest per-frame accuracy (≈96% and ≈95% respectively), with ACF performing worst (≈62%); deep learning models also better manage subharmonic errors across SHR ranges. These findings support employing deep-learning-based estimation in clinical contexts and suggest that retraining with subharmonic data could further improve performance, particularly for high SHR cases.

Abstract

In clinical voice signal analysis, mishandling of subharmonic voicing may cause an acoustic parameter to signal false negatives. As such, the ability of a fundamental frequency estimator to identify speaking fundamental frequency is critical. This paper presents a sustained-vowel study, which used a quality-of-estimate classification to identify subharmonic errors and subharmonics-to-harmonics ratio (SHR) to measure the strength of subharmonic voicing. Five estimators were studied with a sustained vowel dataset: Praat, YAAPT, Harvest, CREPE, and FCN-F0. FCN-F0, a deep-learning model, performed the best both in overall accuracy and in correctly resolving subharmonic signals. CREPE and Harvest are also highly capable estimators for sustained vowel analysis.
Paper Structure (4 sections, 3 equations, 6 figures)

This paper contains 4 sections, 3 equations, 6 figures.

Figures (6)

  • Figure 1: Histogram of the annotated fundamental frequencies $f_o^*$ (15941 samples).
  • Figure 2: (color online) Illustration of harmonic power profile $P(f_o)$ and the accuracy classification of $\hat{f_o}$. The vertical lines indicate the $\hat{f_o}$ of the labeled estimator, and the horizontal bars along the top edge indicate the intervals associated with the truth ($M=1$) and subharmonic errors ($M>1$) ($SHR = -6.8$ dB).
  • Figure 3: (color online) (left column) Scatter plots of estimated $\hat{f}_o$ ($x$-axis) vs. manually selected truth $f_o^*$ ($y$-axis) of the six $f_o$ estimators under study. Diagonal grid lines indicate the mapping of correct and subharmonic estimates ($M=$ 1 to 7). Vertical dotted lines indicate the minimum $f_o$ imposed by each estimator. The samples at $\hat{f}_o=0$ (U) indicate the intervals which were marked as unvoiced by the $f_o$ estimator. (right column) Histograms of the period elongation factor of the estimates, $\hat{M} = f_o^*/\hat{f}_o$. (15941 samples per plot)
  • Figure 4: (color online) Quality-of-estimate outcomes: Correct $f_o$ estimation rates, subharmonics error rates, and other error rates of the $f_o$ estimators under study. Numbers in parentheses are the number of intervals out of 15941 intervals.
  • Figure 5: (color online) Contingency tables of the outcomes of the ACF estimator vs. the other five $f_o$ estimators. Numbers indicate the number of intervals out of 15941 intervals.
  • ...and 1 more figures