Table of Contents
Fetching ...

Towards detecting the pathological subharmonic voicing with fully convolutional neural networks

Takeshi Ikuma, Melda Kunduk, Brad Story, Andrew J. McWhorter

TL;DR

The paper tackles the problem of detecting subharmonic phonation, which biases standard f_o estimates, by training fully convolutional networks on synthetically generated subharmonic signals. Using two FCN variants, FCN-401 and FCN-785, trained with Monte Carlo synthesized data that emulate amplitude and frequency modulation of subharmonics across $M \in \{1,2,3,4\}$, the approach yields high synthetic accuracy ($\approx 98\%$) and demonstrates promising results on real sustained vowels. The findings show the method’s potential to improve subharmonic measures and f_o estimation, while identifying limitations related to biphonation, intermittency, and variability in speaking rate and f_o that guide future work. Overall, this work establishes a data-driven, duration-agnostic pathway for reliable subharmonic detection in clinical voice analysis, with clear directions to broaden applicability to naturalistic speech.

Abstract

Many voice disorders induce subharmonic phonation, but voice signal analysis is currently lacking a technique to detect the presence of subharmonics reliably. Distinguishing subharmonic phonation from normal phonation is a challenging task as both are nearly periodic phenomena. Subharmonic phonation adds cyclical variations to the normal glottal cycles. Hence, the estimation of subharmonic period requires a wholistic analysis of the signals. Deep learning is an effective solution to this type of complex problem. This paper describes fully convolutional neural networks which are trained with synthesized subharmonic voice signals to classify the subharmonic periods. Synthetic evaluation shows over 98% classification accuracy, and assessment of sustained vowel recordings demonstrates encouraging outcomes as well as the areas for future improvements.

Towards detecting the pathological subharmonic voicing with fully convolutional neural networks

TL;DR

The paper tackles the problem of detecting subharmonic phonation, which biases standard f_o estimates, by training fully convolutional networks on synthetically generated subharmonic signals. Using two FCN variants, FCN-401 and FCN-785, trained with Monte Carlo synthesized data that emulate amplitude and frequency modulation of subharmonics across , the approach yields high synthetic accuracy () and demonstrates promising results on real sustained vowels. The findings show the method’s potential to improve subharmonic measures and f_o estimation, while identifying limitations related to biphonation, intermittency, and variability in speaking rate and f_o that guide future work. Overall, this work establishes a data-driven, duration-agnostic pathway for reliable subharmonic detection in clinical voice analysis, with clear directions to broaden applicability to naturalistic speech.

Abstract

Many voice disorders induce subharmonic phonation, but voice signal analysis is currently lacking a technique to detect the presence of subharmonics reliably. Distinguishing subharmonic phonation from normal phonation is a challenging task as both are nearly periodic phenomena. Subharmonic phonation adds cyclical variations to the normal glottal cycles. Hence, the estimation of subharmonic period requires a wholistic analysis of the signals. Deep learning is an effective solution to this type of complex problem. This paper describes fully convolutional neural networks which are trained with synthesized subharmonic voice signals to classify the subharmonic periods. Synthetic evaluation shows over 98% classification accuracy, and assessment of sustained vowel recordings demonstrates encouraging outcomes as well as the areas for future improvements.
Paper Structure (6 sections, 23 equations, 8 figures, 1 table)

This paper contains 6 sections, 23 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Fully convolutional neural network architecture under study: FCN-401 with $X=16$ and $Y=512$, and FCN-785 with $X=64$ and $Y=128$. All convolutional layers are applied with a stride of 1, and there is a batch normalization layer (not pictured) before every ReLU.
  • Figure 2: Block diagram of the transmission-line voice synthesis model.
  • Figure 3: Sythetic classification confusion matrices: (a) FCN-401, (b) FCN-785.
  • Figure 4: Synthetic classification performance vs. SHR ($M>1$ signals only): (a) SHR distribution and (b) classification accuracy vs. SHR.
  • Figure 5: Synthetic classification performance vs. $f_o$. Dashed lines are the least squares fitted lines.
  • ...and 3 more figures