Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge

Shuiyun Liu; Yuxiang Kong; Pengcheng Guo; Weiji Zhuang; Peng Gao; Yujun Wang; Lei Xie

Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge

Shuiyun Liu, Yuxiang Kong, Pengcheng Guo, Weiji Zhuang, Peng Gao, Yujun Wang, Lei Xie

TL;DR

This work tackles dysarthric wake-up word spotting under low-resource conditions by proposing PD-DWS, an end-to-end system that combines a pretrained data2vec2-based encoder trained in a multi-task setting (ASR and WWS) with a two-stage dual-filter to suppress false accepts. The 2branch-d2v2 model jointly optimizes ASR and WWS losses, specifically $L = 0.5 \cdot L_{\text{CTC}} + 1.0 \cdot L_{\text{WWS}}$, and is bolstered by dynamic augmentations; a Threshold Filter and an ASR Filter further refine detections using thresholding and cross-verification with Paraformer ASR outputs. Additionally, TTS-based data augmentation via a VITS system improves robustness to dysarthric speech by enabling Paraformer fine-tuning on synthetic data. Empirical results on the LRDWWS dataset show that PD-DWS achieves a FAR of 0.00321 and FRR of 0.005, securing first place on the test-B eval set, which demonstrates strong performance improvements in low-resource dysarthria wake-word spotting scenarios. The methodology offers a scalable approach for speaker-specific wake-word systems in healthcare and smart-device contexts where data is scarce and speech is highly variable.

Abstract

Speech has emerged as a widely embraced user interface across diverse applications. However, for individuals with dysarthria, the inherent variability in their speech poses significant challenges. This paper presents an end-to-end Pretrain-based Dual-filter Dysarthria Wake-up word Spotting (PD-DWS) system for the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge. Specifically, our system improves performance from two key perspectives: audio modeling and dual-filter strategy. For audio modeling, we propose an innovative 2branch-d2v2 model based on the pre-trained data2vec2 (d2v2), which can simultaneously model automatic speech recognition (ASR) and wake-up word spotting (WWS) tasks through a unified multi-task finetuning paradigm. Additionally, a dual-filter strategy is introduced to reduce the false accept rate (FAR) while maintaining the same false reject rate (FRR). Experimental results demonstrate that our PD-DWS system achieves an FAR of 0.00321 and an FRR of 0.005, with a total score of 0.00821 on the test-B eval set, securing first place in the challenge.

Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge

TL;DR

, and is bolstered by dynamic augmentations; a Threshold Filter and an ASR Filter further refine detections using thresholding and cross-verification with Paraformer ASR outputs. Additionally, TTS-based data augmentation via a VITS system improves robustness to dysarthric speech by enabling Paraformer fine-tuning on synthetic data. Empirical results on the LRDWWS dataset show that PD-DWS achieves a FAR of 0.00321 and FRR of 0.005, securing first place on the test-B eval set, which demonstrates strong performance improvements in low-resource dysarthria wake-word spotting scenarios. The methodology offers a scalable approach for speaker-specific wake-word systems in healthcare and smart-device contexts where data is scarce and speech is highly variable.

Abstract

Paper Structure (15 sections, 2 equations, 2 figures, 7 tables, 1 algorithm)

This paper contains 15 sections, 2 equations, 2 figures, 7 tables, 1 algorithm.

Introduction
PROPOSED SYSTEM
Audio Modeling
Dual-Filter: Threshold Filter
Dual-Filter: ASR Filter
TTS Generator
EXPERIMENT CONFIGURATION
Datasets
Configuration
Evaluation
RESULTS AND ANALYSIS
Comparison with different base model
Comparison with other competition systems
Ablation Study
CONCLUSION

Figures (2)

Figure 1: (a) An overview of our proposed PD-DWS system; (b) Details of the 2branch-d2v2 encoder.
Figure 2: The VITS DBLP:conf/icml/KimKS21 system diagram inference procedure.

Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge

TL;DR

Abstract

Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge

Authors

TL;DR

Abstract

Table of Contents

Figures (2)