Dual Data Scaling for Robust Two-Stage User-Defined Keyword Spotting

Zhiqi Ai; Han Cheng; Yuxin Wang; Shiyi Mu; Shugong Xu; Yongjin Zhou

Dual Data Scaling for Robust Two-Stage User-Defined Keyword Spotting

Zhiqi Ai, Han Cheng, Yuxin Wang, Shiyi Mu, Shugong Xu, Yongjin Zhou

TL;DR

DS-KWS introduces a robust two-stage wake-word spotting framework that combines a CTC-based branch with streaming phoneme search and a QbyT phoneme matcher for verification at both phoneme and utterance levels. A dual data scaling strategy expands ASR training to $1460$ hours and increases anchor classes to $155000$, yielding strong improvements on LibriPhrase-Hard ($EER=6.13\%$, $AUC=97.85\%$) and competitive zero-shot performance on Hey-Snips ($\text{Recall}=99.13\%$ at Far $1$ per hour). The model uses a lightweight phoneme registration module via nn.Embedding to reduce offline overhead, and demonstrates robust performance even when the encoder is frozen. Overall, DS-KWS advances user-defined keyword spotting by improving discrimination among confusable words and enabling near full-shot zero-shot generalization for unseen wake-words, with practical implications for robust, low-latency wake-word detection in real devices.

Abstract

In this paper, we propose DS-KWS, a two-stage framework for robust user-defined keyword spotting. It combines a CTC-based method with a streaming phoneme search module to locate candidate segments, followed by a QbyT-based method with a phoneme matcher module for verification at both the phoneme and utterance levels. To further improve performance, we introduce a dual data scaling strategy: (1) expanding the ASR corpus from 460 to 1,460 hours to strengthen the acoustic model; and (2) leveraging over 155k anchor classes to train the phoneme matcher, significantly enhancing the distinction of confusable words. Experiments on LibriPhrase show that DS-KWS significantly outperforms existing methods, achieving 6.13\% EER and 97.85\% AUC on the Hard subset. On Hey-Snips, it achieves zero-shot performance comparable to full-shot trained models, reaching 99.13\% recall at one false alarm per hour.

Dual Data Scaling for Robust Two-Stage User-Defined Keyword Spotting

TL;DR

Abstract

Dual Data Scaling for Robust Two-Stage User-Defined Keyword Spotting

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)