AdaKWS: Towards Robust Keyword Spotting with Test-Time Adaptation
Yang Xiao, Tianyi Peng, Yanghao Zhou, Rohan Kumar Das
TL;DR
AdaKWS tackles the problem of robust spoken keyword spotting under unseen and noisy conditions by introducing test-time adaptation. It combines selective entropy minimization to adapt only reliable samples and PKC-based resampling to preserve stable, feature-relevant signals, guided by a weighted loss. The method demonstrates superior robustness across Gaussian and real-world noises on Google Speech Commands, outperforming strong TTA baselines and ablations show that both entropy and PKC components contribute. This work enables on-device KWS systems to adapt to new acoustic environments without access to training data, with potential impact on privacy-preserving, edge-based speech interfaces.
Abstract
Spoken keyword spotting (KWS) aims to identify keywords in audio for wide applications, especially on edge devices. Current small-footprint KWS systems focus on efficient model designs. However, their inference performance can decline in unseen environments or noisy backgrounds. Test-time adaptation (TTA) helps models adapt to test samples without needing the original training data. In this study, we present AdaKWS, the first TTA method for robust KWS to the best of our knowledge. Specifically, 1) We initially optimize the model's confidence by selecting reliable samples based on prediction entropy minimization and adjusting the normalization statistics in each batch. 2) We introduce pseudo-keyword consistency (PKC) to identify critical, reliable features without overfitting to noise. Our experiments show that AdaKWS outperforms other methods across various conditions, including Gaussian noise and real-scenario noises. The code will be released in due course.
