NTC-KWS: Noise-aware CTC for Robust Keyword Spotting
Yu Xi, Haoyu Li, Hao Li, Jiaqi Guo, Xu Li, Wen Ding, Kai Yu
TL;DR
This work addresses the vulnerability of CTC-based keyword spotting to noise on resource-limited devices. It introduces NTC-KWS, a noise-aware CTC framework that jointly trains and decodes with two WFST-based wildcard arcs (self-loop and bypass) to model insertion and masking errors, expanding the search space via a new decoding graph G_ntc. Empirical results on Hey Snips show that NTC-KWS outperforms state-of-the-art end-to-end baselines and standard CTC-KWS across varying acoustic conditions, with particularly large gains at extreme low SNRs (-5 dB and 0 dB). The approach achieves 4.9% absolute improvement in average recall over CTC-KWS and demonstrates robust performance in noisy environments, underscoring its practical potential for robust wake-word detection.
Abstract
In recent years, there has been a growing interest in designing small-footprint yet effective Connectionist Temporal Classification based keyword spotting (CTC-KWS) systems. They are typically deployed on low-resource computing platforms, where limitations on model size and computational capacity create bottlenecks under complicated acoustic scenarios. Such constraints often result in overfitting and confusion between keywords and background noise, leading to high false alarms. To address these issues, we propose a noise-aware CTC-based KWS (NTC-KWS) framework designed to enhance model robustness in noisy environments, particularly under extremely low signal-to-noise ratios. Our approach introduces two additional noise-modeling wildcard arcs into the training and decoding processes based on weighted finite state transducer (WFST) graphs: self-loop arcs to address noise insertion errors and bypass arcs to handle masking and interference caused by excessive noise. Experiments on clean and noisy Hey Snips show that NTC-KWS outperforms state-of-the-art (SOTA) end-to-end systems and CTC-KWS baselines across various acoustic conditions, with particularly strong performance in low SNR scenarios.
