Table of Contents
Fetching ...

Dual Data Scaling for Robust Two-Stage User-Defined Keyword Spotting

Zhiqi Ai, Han Cheng, Yuxin Wang, Shiyi Mu, Shugong Xu, Yongjin Zhou

TL;DR

DS-KWS introduces a robust two-stage wake-word spotting framework that combines a CTC-based branch with streaming phoneme search and a QbyT phoneme matcher for verification at both phoneme and utterance levels. A dual data scaling strategy expands ASR training to $1460$ hours and increases anchor classes to $155000$, yielding strong improvements on LibriPhrase-Hard ($EER=6.13\%$, $AUC=97.85\%$) and competitive zero-shot performance on Hey-Snips ($\text{Recall}=99.13\%$ at Far $1$ per hour). The model uses a lightweight phoneme registration module via nn.Embedding to reduce offline overhead, and demonstrates robust performance even when the encoder is frozen. Overall, DS-KWS advances user-defined keyword spotting by improving discrimination among confusable words and enabling near full-shot zero-shot generalization for unseen wake-words, with practical implications for robust, low-latency wake-word detection in real devices.

Abstract

In this paper, we propose DS-KWS, a two-stage framework for robust user-defined keyword spotting. It combines a CTC-based method with a streaming phoneme search module to locate candidate segments, followed by a QbyT-based method with a phoneme matcher module for verification at both the phoneme and utterance levels. To further improve performance, we introduce a dual data scaling strategy: (1) expanding the ASR corpus from 460 to 1,460 hours to strengthen the acoustic model; and (2) leveraging over 155k anchor classes to train the phoneme matcher, significantly enhancing the distinction of confusable words. Experiments on LibriPhrase show that DS-KWS significantly outperforms existing methods, achieving 6.13\% EER and 97.85\% AUC on the Hard subset. On Hey-Snips, it achieves zero-shot performance comparable to full-shot trained models, reaching 99.13\% recall at one false alarm per hour.

Dual Data Scaling for Robust Two-Stage User-Defined Keyword Spotting

TL;DR

DS-KWS introduces a robust two-stage wake-word spotting framework that combines a CTC-based branch with streaming phoneme search and a QbyT phoneme matcher for verification at both phoneme and utterance levels. A dual data scaling strategy expands ASR training to hours and increases anchor classes to , yielding strong improvements on LibriPhrase-Hard (, ) and competitive zero-shot performance on Hey-Snips ( at Far per hour). The model uses a lightweight phoneme registration module via nn.Embedding to reduce offline overhead, and demonstrates robust performance even when the encoder is frozen. Overall, DS-KWS advances user-defined keyword spotting by improving discrimination among confusable words and enabling near full-shot zero-shot generalization for unseen wake-words, with practical implications for robust, low-latency wake-word detection in real devices.

Abstract

In this paper, we propose DS-KWS, a two-stage framework for robust user-defined keyword spotting. It combines a CTC-based method with a streaming phoneme search module to locate candidate segments, followed by a QbyT-based method with a phoneme matcher module for verification at both the phoneme and utterance levels. To further improve performance, we introduce a dual data scaling strategy: (1) expanding the ASR corpus from 460 to 1,460 hours to strengthen the acoustic model; and (2) leveraging over 155k anchor classes to train the phoneme matcher, significantly enhancing the distinction of confusable words. Experiments on LibriPhrase show that DS-KWS significantly outperforms existing methods, achieving 6.13\% EER and 97.85\% AUC on the Hard subset. On Hey-Snips, it achieves zero-shot performance comparable to full-shot trained models, reaching 99.13\% recall at one false alarm per hour.

Paper Structure

This paper contains 12 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overall architecture of the DS-KWS model. The CTC branch extracts phoneme sequences, and the phoneme search module outputs score $S_1$ and candidate segments. The phoneme matcher produces the second-stage score $S_2$.
  • Figure 2: Implementation of the Phoneme Matcher Module.
  • Figure 3: Comparison of score distributions for DS-KWS-M0 and -M2.