Table of Contents
Fetching ...

Fine-Tuning Automatic Speech Recognition for People with Parkinson's: An Effective Strategy for Enhancing Speech Technology Accessibility

Xiuwen Zheng, Bornali Phukon, Mark Hasegawa-Johnson

TL;DR

The paper addresses ASR for Parkinson's disease-related dysarthria/dysphonia by fine-tuning a pretrained wav2vec 2.0 model on the SAP-1005 dataset, demonstrating substantial relative improvements over LibriSpeech baselines. It systematically evaluates speaker clustering, severity-dependent modeling, weighted training, and multi-task learning with an auxiliary severity output, identifying conditions under which each approach helps. The best overall $WER$ achieved is about $26.53\%$, attained via cluster-weighted training or multi-task learning with a first-token severity classifier, with severity-based strategies offering the most consistent gains for severe impairment. These results highlight practical, impairment-informed strategies to enhance the accessibility of speech technology for people with Parkinson's disease, supporting more inclusive ASR applications.

Abstract

This paper enhances dysarthric and dysphonic speech recognition by fine-tuning pretrained automatic speech recognition (ASR) models on the 2023-10-05 data package of the Speech Accessibility Project (SAP), which contains the speech of 253 people with Parkinson's disease. Experiments tested methods that have been effective for Cerebral Palsy, including the use of speaker clustering and severity-dependent models, weighted fine-tuning, and multi-task learning. Best results were obtained using a multi-task learning model, in which the ASR is trained to produce an estimate of the speaker's impairment severity as an auxiliary output. The resulting word error rates are considerably improved relative to a baseline model fine-tuned using only Librispeech data, with word error rate improvements of 37.62\% and 26.97\% compared to fine-tuning on 100h and 960h of LibriSpeech data, respectively.

Fine-Tuning Automatic Speech Recognition for People with Parkinson's: An Effective Strategy for Enhancing Speech Technology Accessibility

TL;DR

The paper addresses ASR for Parkinson's disease-related dysarthria/dysphonia by fine-tuning a pretrained wav2vec 2.0 model on the SAP-1005 dataset, demonstrating substantial relative improvements over LibriSpeech baselines. It systematically evaluates speaker clustering, severity-dependent modeling, weighted training, and multi-task learning with an auxiliary severity output, identifying conditions under which each approach helps. The best overall achieved is about , attained via cluster-weighted training or multi-task learning with a first-token severity classifier, with severity-based strategies offering the most consistent gains for severe impairment. These results highlight practical, impairment-informed strategies to enhance the accessibility of speech technology for people with Parkinson's disease, supporting more inclusive ASR applications.

Abstract

This paper enhances dysarthric and dysphonic speech recognition by fine-tuning pretrained automatic speech recognition (ASR) models on the 2023-10-05 data package of the Speech Accessibility Project (SAP), which contains the speech of 253 people with Parkinson's disease. Experiments tested methods that have been effective for Cerebral Palsy, including the use of speaker clustering and severity-dependent models, weighted fine-tuning, and multi-task learning. Best results were obtained using a multi-task learning model, in which the ASR is trained to produce an estimate of the speaker's impairment severity as an auxiliary output. The resulting word error rates are considerably improved relative to a baseline model fine-tuned using only Librispeech data, with word error rate improvements of 37.62\% and 26.97\% compared to fine-tuning on 100h and 960h of LibriSpeech data, respectively.
Paper Structure (11 sections, 2 equations, 2 figures, 4 tables)

This paper contains 11 sections, 2 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: t-SNE plots of x-vectors with K-means clustering (K=2), using SAP-1005 training split. Top: speaker level; Bottom: utterance level.
  • Figure 2: Relative improvements per severity level of the best performing models fine-tuned by 1, 2, 3 and 4 severity classes of the SAP-1005 corpus, compared to the model fine-tuned by 960 hours of LibriSpeech.