Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System Design

Ming Gao; Hang Chen; Jun Du; Xin Xu; Hongxiao Guo; Hui Bu; Jianxing Yang; Ming Li; Chin-Hui Lee

Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System Design

Ming Gao, Hang Chen, Jun Du, Xin Xu, Hongxiao Guo, Hui Bu, Jianxing Yang, Ming Li, Chin-Hui Lee

TL;DR

The paper tackles inclusive wake-up word spotting for dysarthric speakers by releasing the Mandarin Dysarthria Speech Corpus (MDSC) and designing a customized WWS system. It presents a detailed dataset with 18,630 recordings (17 hours) from 21 dysarthric and 25 control speakers, along with intelligibility annotations and enrollment data, and demonstrates the limitations of conventional systems on dysarthric speech. A three-tier WWS framework (SIC, SID, SDD) is proposed, with baseline DS-TCN and augmentation, and enrollment-based speaker customization yielding substantial improvements—especially for moderately intelligible users—while still facing challenges for highly unintelligible cases. The work advances practical accessibility for dysarthria in smart-home contexts and lays groundwork for language- and speaker-aware WWS research, with potential societal impact in reducing exclusion from voice-controlled technologies.

Abstract

Smart home technology has gained widespread adoption, facilitating effortless control of devices through voice commands. However, individuals with dysarthria, a motor speech disorder, face challenges due to the variability of their speech. This paper addresses the wake-up word spotting (WWS) task for dysarthric individuals, aiming to integrate them into real-world applications. To support this, we release the open-source Mandarin Dysarthria Speech Corpus (MDSC), a dataset designed for dysarthric individuals in home environments. MDSC encompasses information on age, gender, disease types, and intelligibility evaluations. Furthermore, we perform comprehensive experimental analysis on MDSC, highlighting the challenges encountered. We also develop a customized dysarthria WWS system that showcases robustness in handling intelligibility and achieving exceptional performance. MDSC will be released on https://www.aishelltech.com/AISHELL_6B.

Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System Design

TL;DR

Abstract

Paper Structure (14 sections, 1 equation, 5 figures, 2 tables)

This paper contains 14 sections, 1 equation, 5 figures, 2 tables.

Introduction
MDSC
Statistics
Collection
Intelligibility Evaluation
Description of Dysarthria WWS System
Baseline WWS Models
Speaker-dependent Dysarthria WWS Model
Experiments and Analysis
Metrics
Analysis of the Baseline WWS Models
Analysis of Speaker-dependent Dysarthria WWS Model
Conclusion
Acknowledgements

Figures (5)

Figure 1: The framework of speaker-dependent dysarthria WWS. The upper part displays the specific network details, where the network architecture of the Speaker-independent Control (SIC), Speaker-independent Dysarthria (SID), and Speaker-dependent Dysarthria (SDD) WWS models are identical. The lower part illustrates the relationships among these models and the overall training process. The SIC model is trained using this network architecture on the C-train dataset. Based on the SIC model, the SID model is fine-tuned with the D-train dataset. Furthermore, the SDD model specific to an individual is fine-tuned using their corresponding D-enroll dataset, building upon the foundation of the SID model.
Figure 2: Intelligibility-score relationship for individuals with dysarthria on a conventional WWS system.
Figure 3: Wake-up performance of SIC, SID and SDD models on D1-D6 test sets. A lower score indicates better performance.
Figure 4: (a) The performance for different positive-to-negative ratios of enrollment utterances. The x-axis represents the duration ratio of positive to negative instances, ranging from 1:0 to 1:10. (b) The performance for different durations of enrollment utterances. The x-axis represents the entire duration of enrollment utterances, ranging from 1 minute to 3 minutes.
Figure 5: The wake-up failure cases of individuals with dysarthria. In case (a), the speaker exhibits sudden pauses and breaths within the sentence. In case (b), the speaker experiences a decreased volume in the latter half of the phrase, accompanied by a noticeable prolongation of sounds and instances of speech coarticulation.

Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System Design

TL;DR

Abstract

Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System Design

Authors

TL;DR

Abstract

Table of Contents

Figures (5)