PB-LRDWWS System for the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge

Shiyao Wang; Jiaming Zhou; Shiwan Zhao; Yong Qin

PB-LRDWWS System for the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge

Shiyao Wang, Jiaming Zhou, Shiwan Zhao, Yong Qin

TL;DR

This work tackles wake-up word spotting for low-resource dysarthric speech by introducing PB-LRDWWS, a system that couples a three-stage HuBERT-based dysarthric content feature extractor with prototype-based classification. Prototypes are built from enrollment speech by averaging features, and evaluation relies on cosine similarity between test features and prototypes across $11$ prototypes (10 keywords plus non-keyword). The study systematically compares data augmentation (including TTS-based keyword synthesis and Merge_train), loss functions (CTC, CE, and AddSCL), and classification settings (PB-C, KNN-C, and Model prediction), revealing that cross-entropy loss with PB-C generally yields the best final performance, while CTC benefits from SCL and augmentation but can suffer from class imbalance. The PB-LRDWWS system achieves second place on final Test-B with a score of $0.009801$, demonstrating strong performance in a challenging dysarthric KWS setting and offering a practical approach for personalized wake-up word spotting in assistive devices.

Abstract

For the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting (LRDWWS) Challenge, we introduce the PB-LRDWWS system. This system combines a dysarthric speech content feature extractor for prototype construction with a prototype-based classification method. The feature extractor is a fine-tuned HuBERT model obtained through a three-stage fine-tuning process using cross-entropy loss. This fine-tuned HuBERT extracts features from the target dysarthric speaker's enrollment speech to build prototypes. Classification is achieved by calculating the cosine similarity between the HuBERT features of the target dysarthric speaker's evaluation speech and prototypes. Despite its simplicity, our method demonstrates effectiveness through experimental results. Our system achieves second place in the final Test-B of the LRDWWS Challenge.

PB-LRDWWS System for the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge

TL;DR

prototypes (10 keywords plus non-keyword). The study systematically compares data augmentation (including TTS-based keyword synthesis and Merge_train), loss functions (CTC, CE, and AddSCL), and classification settings (PB-C, KNN-C, and Model prediction), revealing that cross-entropy loss with PB-C generally yields the best final performance, while CTC benefits from SCL and augmentation but can suffer from class imbalance. The PB-LRDWWS system achieves second place on final Test-B with a score of

, demonstrating strong performance in a challenging dysarthric KWS setting and offering a practical approach for personalized wake-up word spotting in assistive devices.

Abstract

Paper Structure (16 sections, 1 equation, 2 figures, 2 tables)

This paper contains 16 sections, 1 equation, 2 figures, 2 tables.

Introduction
Related work
LRDWWS Challenge
Our contribution
System Overview
Training phase: building dysarthric speech content feature extractor
Inference phase: PB-LRDWWS
Implementation Details
Dataset
Experimental setup
Evaluations and results
Building speaker-independent dysarthria models with better generalization ability
Building speaker-dependent dysarthria models and comparing the effects of different classification settings
Final result
Conclusion
...and 1 more sections

Figures (2)

Figure 1: A three-stage fine-tuning process for building dysarthric speech content feature extractor.
Figure 2: PB-LRDWWS.

PB-LRDWWS System for the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge

TL;DR

Abstract

PB-LRDWWS System for the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge

Authors

TL;DR

Abstract

Table of Contents

Figures (2)