Table of Contents
Fetching ...

Multiple-Instance, Cascaded Classification for Keyword Spotting in Narrow-Band Audio

Ahmad AbdulKader, Kareem Nassar, Mohamed El-Geish, Daniel Galvez, Chetan Patil

TL;DR

This work targets real-time keyword spotting in narrow-band NB 8 kHz audio under non-IID conditions. It introduces a cascaded DNN system that employs two distinct feature representations (MFCC and PLP) and frames the problem within a multiple-instance learning framework, enabling early termination and robust handling of hard negatives. The key contributions are the integration of multi-representation features with a three-stage cascade and MIL aggregation, achieving substantial reductions in hourly false positives at modest false-negative rates. The approach offers practical benefits for energy-constrained devices and noisy, real-world environments, showing competitive performance against wide-band baselines in the NB setting.

Abstract

We propose using cascaded classifiers for a keyword spotting (KWS) task on narrow-band (NB), 8kHz audio acquired in non-IID environments -- a more challenging task than most state-of-the-art KWS systems face. We present a model that incorporates Deep Neural Networks (DNNs), cascading, multiple-feature representations, and multiple-instance learning. The cascaded classifiers handle the task's class imbalance and reduce power consumption on computationally-constrained devices via early termination. The KWS system achieves a false negative rate of 6% at an hourly false positive rate of 0.75

Multiple-Instance, Cascaded Classification for Keyword Spotting in Narrow-Band Audio

TL;DR

This work targets real-time keyword spotting in narrow-band NB 8 kHz audio under non-IID conditions. It introduces a cascaded DNN system that employs two distinct feature representations (MFCC and PLP) and frames the problem within a multiple-instance learning framework, enabling early termination and robust handling of hard negatives. The key contributions are the integration of multi-representation features with a three-stage cascade and MIL aggregation, achieving substantial reductions in hourly false positives at modest false-negative rates. The approach offers practical benefits for energy-constrained devices and noisy, real-world environments, showing competitive performance against wide-band baselines in the NB setting.

Abstract

We propose using cascaded classifiers for a keyword spotting (KWS) task on narrow-band (NB), 8kHz audio acquired in non-IID environments -- a more challenging task than most state-of-the-art KWS systems face. We present a model that incorporates Deep Neural Networks (DNNs), cascading, multiple-feature representations, and multiple-instance learning. The cascaded classifiers handle the task's class imbalance and reduce power consumption on computationally-constrained devices via early termination. The KWS system achieves a false negative rate of 6% at an hourly false positive rate of 0.75

Paper Structure

This paper contains 11 sections, 3 figures.

Figures (3)

  • Figure 1: KWS system diagram: (i) Feature Extraction (ii) Cascaded classifiers (iii) Noisy Or
  • Figure 2: A plot showing the effects of using PLPs, MFCCs, and multi-representation models.
  • Figure 3: A plot showing the effects of using cascaded classifiers