Table of Contents
Fetching ...

Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech

Yerin Choi, Jeehyun Lee, Myoung-Wan Koo

TL;DR

This paper tackles automatic severity evaluation of dysarthric speech, aiming to improve both explainability and predictive performance. It introduces speech recognition-based features by fine-tuning a dysarthric ASR (DysarthricWhisper) to transcribe speech and extract word boundaries, organizing features into Pronunciation Correctness and Structural Prosody. The proposed SR-features achieve a balanced accuracy of $83.72\%$, outperforming both waveform-based and DNN baselines while preserving interpretability. The work demonstrates clinically meaningful explanations for dysarthria severity and provides publicly available code to facilitate reproducibility and adoption in practice.

Abstract

Due to the subjective nature of current clinical evaluation, the need for automatic severity evaluation in dysarthric speech has emerged. DNN models outperform ML models but lack user-friendly explainability. ML models offer explainable results at a feature level, but their performance is comparatively lower. Current ML models extract various features from raw waveforms to predict severity. However, existing methods do not encompass all dysarthric features used in clinical evaluation. To address this gap, we propose a feature extraction method that minimizes information loss. We introduce an ASR transcription as a novel feature extraction source. We finetune the ASR model for dysarthric speech, then use this model to transcribe dysarthric speech and extract word segment boundary information. It enables capturing finer pronunciation and broader prosodic features. These features demonstrated an improved severity prediction performance to existing features: balanced accuracy of 83.72%.

Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech

TL;DR

This paper tackles automatic severity evaluation of dysarthric speech, aiming to improve both explainability and predictive performance. It introduces speech recognition-based features by fine-tuning a dysarthric ASR (DysarthricWhisper) to transcribe speech and extract word boundaries, organizing features into Pronunciation Correctness and Structural Prosody. The proposed SR-features achieve a balanced accuracy of , outperforming both waveform-based and DNN baselines while preserving interpretability. The work demonstrates clinically meaningful explanations for dysarthria severity and provides publicly available code to facilitate reproducibility and adoption in practice.

Abstract

Due to the subjective nature of current clinical evaluation, the need for automatic severity evaluation in dysarthric speech has emerged. DNN models outperform ML models but lack user-friendly explainability. ML models offer explainable results at a feature level, but their performance is comparatively lower. Current ML models extract various features from raw waveforms to predict severity. However, existing methods do not encompass all dysarthric features used in clinical evaluation. To address this gap, we propose a feature extraction method that minimizes information loss. We introduce an ASR transcription as a novel feature extraction source. We finetune the ASR model for dysarthric speech, then use this model to transcribe dysarthric speech and extract word segment boundary information. It enables capturing finer pronunciation and broader prosodic features. These features demonstrated an improved severity prediction performance to existing features: balanced accuracy of 83.72%.

Paper Structure

This paper contains 13 sections, 2 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Top 5 most important features of the proposed model tested on the test set.
  • Figure 2: Examples of result explanation. The top-side illustration compares speech signals and fundamental frequencies between a healthy speaker and a Parkinson's Disease patient. On the bottom, an ASR-based illustration highlights character-level changes in speech. The upper part of the bottom-side illustration represents a healthy speaker's reading, while the lower part displays the ASR model's inferred outcome for a patient.