Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of Speech Sound Disorders in Korean children
Taekyung Ahn, Yeonjung Hong, Younggon Im, Do Hyung Kim, Dayoung Kang, Joo Won Jeong, Jae Won Kim, Min Jung Kim, Ah-ra Cho, Dae-Hyun Jang, Hosung Nam
TL;DR
This work adapts a large multilingual end-to-end ASR model (wav2vec2.0 XLS-R) to diagnose pronunciation errors in Korean children with SSDs, using a clinically chosen set of 73 words and clinician-provided pronunciations as targets. Through fine-tuning with roughly 1.5 hours of labeled data, the model achieves about 90% concordance with human annotations and substantially outperforms a baseline Whisper model on this task. The study also investigates language-model weighting and a pronunciation-error dictionary, finding limited or mixed benefits for single-word pronunciation recognition. Overall, the approach demonstrates the feasibility of automating SSD pronunciation assessment in a low-resource, clinically relevant setting, with potential to streamline diagnostic workflows in Korean pediatric clinics.
Abstract
This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Since ASR models trained for general purposes primarily predict input speech into real words, employing a well-known high-performance ASR model for evaluating pronunciation in children with SSDs is impractical. We fine-tuned the wav2vec 2.0 XLS-R model to recognize speech as pronounced rather than as existing words. The model was fine-tuned with a speech dataset from 137 children with inadequate speech production pronouncing 73 Korean words selected for actual clinical diagnosis. The model's predictions of the pronunciations of the words matched the human annotations with about 90% accuracy. While the model still requires improvement in recognizing unclear pronunciation, this study demonstrates that ASR models can streamline complex pronunciation error diagnostic procedures in clinical fields.
