A Comprehensive Rubric for Annotating Pathological Speech
Mario Corrales-Astorgano, David Escudero-Mancebo, Lourdes Aguilar, Valle Flores-Lucas, Valentín Cardeñoso-Payo, Carlos Vivaracho-Pascual, César González-Ferreras
TL;DR
This paper introduces a three-dimensional rubric (phonetics, fluency, prosody) for annotating pathological speech in individuals with Down syndrome and describes its development, validation, and a supporting annotation website. It then applies the rubric to the Prautocal corpus and conducts two preliminary experiments: GoP-based phonetic assessment using wav2vec-derived probabilities with segmentation by MFA, and stuttering-disfluency detection using wav2vec 2.0 embeddings with an SVM classifier trained on the KSoF corpus. Results show only moderate correlations between GoP and phonetic judgments and variable, often imbalanced, performance across disfluency types, underscoring the challenges of automatic pathology labeling and the need for further data and multiple evaluators. The work establishes a framework and dataset for advancing automatic pathology assessment in DS speech and points to concrete future directions in model tuning and annotation strategies to improve reliability and utility in speech therapy contexts.
Abstract
Rubrics are a commonly used tool for labeling voice corpora in speech quality assessment, although their application in the context of pathological speech remains relatively limited. In this study, we introduce a comprehensive rubric based on various dimensions of speech quality, including phonetics, fluency, and prosody. The objective is to establish standardized criteria for identifying errors within the speech of individuals with Down syndrome, thereby enabling the development of automated assessment systems. To achieve this objective, we utilized the Prautocal corpus. To assess the quality of annotations using our rubric, two experiments were conducted, focusing on phonetics and fluency. For phonetic evaluation, we employed the Goodness of Pronunciation (GoP) metric, utilizing automatic segmentation systems and correlating the results with evaluations conducted by a specialized speech therapist. While the obtained correlation values were not notably high, a positive trend was observed. In terms of fluency assessment, deep learning models like wav2vec were used to extract audio features, and we employed an SVM classifier trained on a corpus focused on identifying fluency issues to categorize Prautocal corpus samples. The outcomes highlight the complexities of evaluating such phenomena, with variability depending on the specific type of disfluency detected.
