Table of Contents
Fetching ...

TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

Yu Xi, Hao Li, Baochen Yang, Haoyu Li, Hainan Xu, Kai Yu

TL;DR

DTDT-KWS is proposed, which leverages token-and-duration Transducers (TDT) for KWS tasks and achieves on-par or better wake word detection performance than both RNN-T and traditional TDT-ASR systems while achieving significant inference speed-up.

Abstract

Designing an efficient keyword spotting (KWS) system that delivers exceptional performance on resource-constrained edge devices has long been a subject of significant attention. Existing KWS search algorithms typically follow a frame-synchronous approach, where search decisions are made repeatedly at each frame despite the fact that most frames are keyword-irrelevant. In this paper, we propose TDT-KWS, which leverages token-and-duration Transducers (TDT) for KWS tasks. We also propose a novel KWS task-specific decoding algorithm for Transducer-based models, which supports highly effective frame-asynchronous keyword search in streaming speech scenarios. With evaluations conducted on both the public Hey Snips and self-constructed LibriKWS-20 datasets, our proposed KWS-decoding algorithm produces more accurate results than conventional ASR decoding algorithms. Additionally, TDT-KWS achieves on-par or better wake word detection performance than both RNN-T and traditional TDT-ASR systems while achieving significant inference speed-up. Furthermore, experiments show that TDT-KWS is more robust to noisy environments compared to RNN-T KWS.

TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

TL;DR

DTDT-KWS is proposed, which leverages token-and-duration Transducers (TDT) for KWS tasks and achieves on-par or better wake word detection performance than both RNN-T and traditional TDT-ASR systems while achieving significant inference speed-up.

Abstract

Designing an efficient keyword spotting (KWS) system that delivers exceptional performance on resource-constrained edge devices has long been a subject of significant attention. Existing KWS search algorithms typically follow a frame-synchronous approach, where search decisions are made repeatedly at each frame despite the fact that most frames are keyword-irrelevant. In this paper, we propose TDT-KWS, which leverages token-and-duration Transducers (TDT) for KWS tasks. We also propose a novel KWS task-specific decoding algorithm for Transducer-based models, which supports highly effective frame-asynchronous keyword search in streaming speech scenarios. With evaluations conducted on both the public Hey Snips and self-constructed LibriKWS-20 datasets, our proposed KWS-decoding algorithm produces more accurate results than conventional ASR decoding algorithms. Additionally, TDT-KWS achieves on-par or better wake word detection performance than both RNN-T and traditional TDT-ASR systems while achieving significant inference speed-up. Furthermore, experiments show that TDT-KWS is more robust to noisy environments compared to RNN-T KWS.
Paper Structure (14 sections, 4 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 4 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Decoding path for RNN-T KWS System. Each node $(t, u)$ represents the highest score obtained by outputting the first $u$ elements of the keyword up to time $t$. The horizontal arrow originating from node $(t, u)$ indicates the probability $\phi(t, u)$ of outputting blank. The vertical arrow represents the probability $y(t, u)$ of outputting the $(u+1)$-th element of the keyword at time $t$. To identify the optimal path for the keyword at time $t$, the path with the maximum score is illustrated by red arrows. This path corresponds to the most probable sequence of the keyword at time t.
  • Figure 2: Heatmaps of the wake-up score at each (t,u). The utterance is picked from the test-clean dataset, and the keyword is everything. The vertical yellow dashed lines represent the boundary information derived from force-alignments. Please zoom in to view the details.
  • Figure 3: Recall and inference speed comparison between RNN-T KWS and TDT-KWS at different SNR. SNR=+inf means no noise is added.