Table of Contents
Fetching ...

NTC-KWS: Noise-aware CTC for Robust Keyword Spotting

Yu Xi, Haoyu Li, Hao Li, Jiaqi Guo, Xu Li, Wen Ding, Kai Yu

TL;DR

This work addresses the vulnerability of CTC-based keyword spotting to noise on resource-limited devices. It introduces NTC-KWS, a noise-aware CTC framework that jointly trains and decodes with two WFST-based wildcard arcs (self-loop and bypass) to model insertion and masking errors, expanding the search space via a new decoding graph G_ntc. Empirical results on Hey Snips show that NTC-KWS outperforms state-of-the-art end-to-end baselines and standard CTC-KWS across varying acoustic conditions, with particularly large gains at extreme low SNRs (-5 dB and 0 dB). The approach achieves 4.9% absolute improvement in average recall over CTC-KWS and demonstrates robust performance in noisy environments, underscoring its practical potential for robust wake-word detection.

Abstract

In recent years, there has been a growing interest in designing small-footprint yet effective Connectionist Temporal Classification based keyword spotting (CTC-KWS) systems. They are typically deployed on low-resource computing platforms, where limitations on model size and computational capacity create bottlenecks under complicated acoustic scenarios. Such constraints often result in overfitting and confusion between keywords and background noise, leading to high false alarms. To address these issues, we propose a noise-aware CTC-based KWS (NTC-KWS) framework designed to enhance model robustness in noisy environments, particularly under extremely low signal-to-noise ratios. Our approach introduces two additional noise-modeling wildcard arcs into the training and decoding processes based on weighted finite state transducer (WFST) graphs: self-loop arcs to address noise insertion errors and bypass arcs to handle masking and interference caused by excessive noise. Experiments on clean and noisy Hey Snips show that NTC-KWS outperforms state-of-the-art (SOTA) end-to-end systems and CTC-KWS baselines across various acoustic conditions, with particularly strong performance in low SNR scenarios.

NTC-KWS: Noise-aware CTC for Robust Keyword Spotting

TL;DR

This work addresses the vulnerability of CTC-based keyword spotting to noise on resource-limited devices. It introduces NTC-KWS, a noise-aware CTC framework that jointly trains and decodes with two WFST-based wildcard arcs (self-loop and bypass) to model insertion and masking errors, expanding the search space via a new decoding graph G_ntc. Empirical results on Hey Snips show that NTC-KWS outperforms state-of-the-art end-to-end baselines and standard CTC-KWS across varying acoustic conditions, with particularly large gains at extreme low SNRs (-5 dB and 0 dB). The approach achieves 4.9% absolute improvement in average recall over CTC-KWS and demonstrates robust performance in noisy environments, underscoring its practical potential for robust wake-word detection.

Abstract

In recent years, there has been a growing interest in designing small-footprint yet effective Connectionist Temporal Classification based keyword spotting (CTC-KWS) systems. They are typically deployed on low-resource computing platforms, where limitations on model size and computational capacity create bottlenecks under complicated acoustic scenarios. Such constraints often result in overfitting and confusion between keywords and background noise, leading to high false alarms. To address these issues, we propose a noise-aware CTC-based KWS (NTC-KWS) framework designed to enhance model robustness in noisy environments, particularly under extremely low signal-to-noise ratios. Our approach introduces two additional noise-modeling wildcard arcs into the training and decoding processes based on weighted finite state transducer (WFST) graphs: self-loop arcs to address noise insertion errors and bypass arcs to handle masking and interference caused by excessive noise. Experiments on clean and noisy Hey Snips show that NTC-KWS outperforms state-of-the-art (SOTA) end-to-end systems and CTC-KWS baselines across various acoustic conditions, with particularly strong performance in low SNR scenarios.

Paper Structure

This paper contains 14 sections, 6 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: The overview illustrates noisy data simulation and comparisons of CTC and NTC WFST-based decoding graphs. In the left part, we show three types of errors by a noise simulation example introduced by overwhelming noise: masking and interference (shown in red), and insertion (shown in blue). In the right part, we provide a toy example where the keyword is set to "A B C". The decoding transition rules under grammar and token levels are represented by $\mathcal{G}$ and $\mathcal{S}$, respectively. In NTC, we highlight two types of additional wildcard arcs in red and blue, corresponding to the error colors of the simulated example shown on the left. In the composed graphs $\mathcal{S}$, $\lambda_{1}$ and $\lambda_{2}$ represent the wildcard transition costs. $\epsilon$ to the left of $:$ denotes an empty transition, while on the right, it indicates a null output.