TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

Yu Xi; Hao Li; Baochen Yang; Haoyu Li; Hainan Xu; Kai Yu

TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

Yu Xi, Hao Li, Baochen Yang, Haoyu Li, Hainan Xu, Kai Yu

TL;DR

DTDT-KWS is proposed, which leverages token-and-duration Transducers (TDT) for KWS tasks and achieves on-par or better wake word detection performance than both RNN-T and traditional TDT-ASR systems while achieving significant inference speed-up.

Abstract

Designing an efficient keyword spotting (KWS) system that delivers exceptional performance on resource-constrained edge devices has long been a subject of significant attention. Existing KWS search algorithms typically follow a frame-synchronous approach, where search decisions are made repeatedly at each frame despite the fact that most frames are keyword-irrelevant. In this paper, we propose TDT-KWS, which leverages token-and-duration Transducers (TDT) for KWS tasks. We also propose a novel KWS task-specific decoding algorithm for Transducer-based models, which supports highly effective frame-asynchronous keyword search in streaming speech scenarios. With evaluations conducted on both the public Hey Snips and self-constructed LibriKWS-20 datasets, our proposed KWS-decoding algorithm produces more accurate results than conventional ASR decoding algorithms. Additionally, TDT-KWS achieves on-par or better wake word detection performance than both RNN-T and traditional TDT-ASR systems while achieving significant inference speed-up. Furthermore, experiments show that TDT-KWS is more robust to noisy environments compared to RNN-T KWS.

TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

TL;DR

Abstract

Paper Structure (14 sections, 4 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 4 equations, 3 figures, 3 tables, 1 algorithm.

Introduction
TDT Based Keyword Spotting
Transducers
Token-and-Duration Transducers
Efficient Streaming KWS-Decoding Algorithm
experimental setup
Datasets
Experimental Setup
Evaluation Metrics
results and analysis
Decoding Algorithm Comparison: ASR VS KWS-specific
Model Performance: TDT VS RNN-T
Noise Robustness
conclusions

Figures (3)

Figure 1: Decoding path for RNN-T KWS System. Each node $(t, u)$ represents the highest score obtained by outputting the first $u$ elements of the keyword up to time $t$. The horizontal arrow originating from node $(t, u)$ indicates the probability $\phi(t, u)$ of outputting blank. The vertical arrow represents the probability $y(t, u)$ of outputting the $(u+1)$-th element of the keyword at time $t$. To identify the optimal path for the keyword at time $t$, the path with the maximum score is illustrated by red arrows. This path corresponds to the most probable sequence of the keyword at time t.
Figure 2: Heatmaps of the wake-up score at each (t,u). The utterance is picked from the test-clean dataset, and the keyword is everything. The vertical yellow dashed lines represent the boundary information derived from force-alignments. Please zoom in to view the details.
Figure 3: Recall and inference speed comparison between RNN-T KWS and TDT-KWS at different SNR. SNR=+inf means no noise is added.

TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

TL;DR

Abstract

TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

Authors

TL;DR

Abstract

Table of Contents

Figures (3)