Table of Contents
Fetching ...

AdaKWS: Towards Robust Keyword Spotting with Test-Time Adaptation

Yang Xiao, Tianyi Peng, Yanghao Zhou, Rohan Kumar Das

TL;DR

AdaKWS tackles the problem of robust spoken keyword spotting under unseen and noisy conditions by introducing test-time adaptation. It combines selective entropy minimization to adapt only reliable samples and PKC-based resampling to preserve stable, feature-relevant signals, guided by a weighted loss. The method demonstrates superior robustness across Gaussian and real-world noises on Google Speech Commands, outperforming strong TTA baselines and ablations show that both entropy and PKC components contribute. This work enables on-device KWS systems to adapt to new acoustic environments without access to training data, with potential impact on privacy-preserving, edge-based speech interfaces.

Abstract

Spoken keyword spotting (KWS) aims to identify keywords in audio for wide applications, especially on edge devices. Current small-footprint KWS systems focus on efficient model designs. However, their inference performance can decline in unseen environments or noisy backgrounds. Test-time adaptation (TTA) helps models adapt to test samples without needing the original training data. In this study, we present AdaKWS, the first TTA method for robust KWS to the best of our knowledge. Specifically, 1) We initially optimize the model's confidence by selecting reliable samples based on prediction entropy minimization and adjusting the normalization statistics in each batch. 2) We introduce pseudo-keyword consistency (PKC) to identify critical, reliable features without overfitting to noise. Our experiments show that AdaKWS outperforms other methods across various conditions, including Gaussian noise and real-scenario noises. The code will be released in due course.

AdaKWS: Towards Robust Keyword Spotting with Test-Time Adaptation

TL;DR

AdaKWS tackles the problem of robust spoken keyword spotting under unseen and noisy conditions by introducing test-time adaptation. It combines selective entropy minimization to adapt only reliable samples and PKC-based resampling to preserve stable, feature-relevant signals, guided by a weighted loss. The method demonstrates superior robustness across Gaussian and real-world noises on Google Speech Commands, outperforming strong TTA baselines and ablations show that both entropy and PKC components contribute. This work enables on-device KWS systems to adapt to new acoustic environments without access to training data, with potential impact on privacy-preserving, edge-based speech interfaces.

Abstract

Spoken keyword spotting (KWS) aims to identify keywords in audio for wide applications, especially on edge devices. Current small-footprint KWS systems focus on efficient model designs. However, their inference performance can decline in unseen environments or noisy backgrounds. Test-time adaptation (TTA) helps models adapt to test samples without needing the original training data. In this study, we present AdaKWS, the first TTA method for robust KWS to the best of our knowledge. Specifically, 1) We initially optimize the model's confidence by selecting reliable samples based on prediction entropy minimization and adjusting the normalization statistics in each batch. 2) We introduce pseudo-keyword consistency (PKC) to identify critical, reliable features without overfitting to noise. Our experiments show that AdaKWS outperforms other methods across various conditions, including Gaussian noise and real-scenario noises. The code will be released in due course.

Paper Structure

This paper contains 16 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of the proposed AdaKWS method, including entropy- and PKC-based selection. The blue box highlights the subsets identified by these two approaches, and Cluster 4 corresponds to $x_{pkc}$ in Eq. (6).
  • Figure 2: Robustness analysis of BC-ResNet-3 on GSC dataset under condition of (1) Gaussian noise (2) Babble and Typing from MS-SNSD dataset, (3) Animals and Natural from ESC-50 dataset. "Source" shows performance on clean GSC dataset and ACC stands for accuracy.
  • Figure 3: Comparable study across each test batch size for AdaKWS method. All experiments use 'Domestic' environments (-10dB) in ESC-50 noisy data.