Fine-Tuning Automatic Speech Recognition for People with Parkinson's: An Effective Strategy for Enhancing Speech Technology Accessibility
Xiuwen Zheng, Bornali Phukon, Mark Hasegawa-Johnson
TL;DR
The paper addresses ASR for Parkinson's disease-related dysarthria/dysphonia by fine-tuning a pretrained wav2vec 2.0 model on the SAP-1005 dataset, demonstrating substantial relative improvements over LibriSpeech baselines. It systematically evaluates speaker clustering, severity-dependent modeling, weighted training, and multi-task learning with an auxiliary severity output, identifying conditions under which each approach helps. The best overall $WER$ achieved is about $26.53\%$, attained via cluster-weighted training or multi-task learning with a first-token severity classifier, with severity-based strategies offering the most consistent gains for severe impairment. These results highlight practical, impairment-informed strategies to enhance the accessibility of speech technology for people with Parkinson's disease, supporting more inclusive ASR applications.
Abstract
This paper enhances dysarthric and dysphonic speech recognition by fine-tuning pretrained automatic speech recognition (ASR) models on the 2023-10-05 data package of the Speech Accessibility Project (SAP), which contains the speech of 253 people with Parkinson's disease. Experiments tested methods that have been effective for Cerebral Palsy, including the use of speaker clustering and severity-dependent models, weighted fine-tuning, and multi-task learning. Best results were obtained using a multi-task learning model, in which the ASR is trained to produce an estimate of the speaker's impairment severity as an auxiliary output. The resulting word error rates are considerably improved relative to a baseline model fine-tuned using only Librispeech data, with word error rate improvements of 37.62\% and 26.97\% compared to fine-tuning on 100h and 960h of LibriSpeech data, respectively.
