Impact of Speech Mode in Automatic Pathological Speech Detection
Shakeel A. Sheikh, Ina Kodrasi
TL;DR
This work addresses the gap that automatic pathological speech detection models are largely trained and evaluated on phonetically controlled, non-spontaneous speech. It systematically compares classical machine learning with handcrafted features against deep learning approaches, using both non-spontaneous and spontaneous speech across two datasets, PC-GITA and MoSpeeDi. The study demonstrates that deep learning models, including CNNs, AEs, and wav2vec2-based pipelines, can better ignore non-pathology fluctuations in spontaneous speech and often extract additional pathology cues, whereas classical SVM-based methods show limited gains and are highly feature-dependent. The findings suggest that leveraging self-supervised embeddings and end-to-end deep architectures is especially beneficial for spontaneous speech diagnostics, with implications for scalable, real-world screening and monitoring of motor speech disorders.
Abstract
Automatic pathological speech detection approaches yield promising results in identifying various pathologies. These approaches are typically designed and evaluated for phonetically-controlled speech scenarios, where speakers are prompted to articulate identical phonetic content. While gathering controlled speech recordings can be laborious, spontaneous speech can be conveniently acquired as potential patients navigate their daily routines. Further, spontaneous speech can be valuable in detecting subtle and abstract cues of pathological speech. Nonetheless, the efficacy of automatic pathological speech detection for spontaneous speech remains unexplored. This paper analyzes the influence of speech mode on pathological speech detection approaches, examining two distinct categories of approaches, i.e., classical machine learning and deep learning. Results indicate that classical approaches may struggle to capture pathology-discriminant cues in spontaneous speech. In contrast, deep learning approaches demonstrate superior performance, managing to extract additional cues that were previously inaccessible in non-spontaneous speech
