Multiple-Instance, Cascaded Classification for Keyword Spotting in Narrow-Band Audio
Ahmad AbdulKader, Kareem Nassar, Mohamed El-Geish, Daniel Galvez, Chetan Patil
TL;DR
This work targets real-time keyword spotting in narrow-band NB 8 kHz audio under non-IID conditions. It introduces a cascaded DNN system that employs two distinct feature representations (MFCC and PLP) and frames the problem within a multiple-instance learning framework, enabling early termination and robust handling of hard negatives. The key contributions are the integration of multi-representation features with a three-stage cascade and MIL aggregation, achieving substantial reductions in hourly false positives at modest false-negative rates. The approach offers practical benefits for energy-constrained devices and noisy, real-world environments, showing competitive performance against wide-band baselines in the NB setting.
Abstract
We propose using cascaded classifiers for a keyword spotting (KWS) task on narrow-band (NB), 8kHz audio acquired in non-IID environments -- a more challenging task than most state-of-the-art KWS systems face. We present a model that incorporates Deep Neural Networks (DNNs), cascading, multiple-feature representations, and multiple-instance learning. The cascaded classifiers handle the task's class imbalance and reduce power consumption on computationally-constrained devices via early termination. The KWS system achieves a false negative rate of 6% at an hourly false positive rate of 0.75
