Table of Contents
Fetching ...

Multichannel Keyword Spotting for Noisy Conditions

Dzmitry Saladukha, Ivan Koriabkin, Kanstantsin Artsiom, Aliaksei Rak, Nikita Ryzhikov

TL;DR

This work tackles reliable on-device keyword spotting under noisy conditions by proposing a multichannel neural architecture that uses an attention mechanism to dynamically weight and fuse inputs from multiple microphones, including an adaptive noise cancellation channel. The authors integrate a multichannel noise reduction strategy with attention-based channel fusion, and evaluate on both controlled Acoustic Laboratory data and real-world smart speaker data, showing that Attention KWS + ANC achieves the best false rejection rate at a fixed false alarm rate while being CPU- and memory-efficient for on-device deployment. Key contributions include a baseline SVDF-like single-channel model, a channel-augmented ensemble, and a compact attention module (~50k parameters) that significantly improves performance over beamformers and ensemble methods. The proposed approach offers practical impact for robust wake-word activation on consumer devices and suggests future directions for selective channel transmission to cloud-based ASR systems to balance accuracy and bandwidth.

Abstract

This article presents a method for improving a keyword spotter (KWS) algorithm in noisy environments. Although beamforming (BF) and adaptive noise cancellation (ANC) techniques are robust in some conditions, they may degrade the performance of the activation system by distorting or suppressing useful signals. The authors propose a neural network architecture that uses several input channels and an attention mechanism that allows the network to determine the most useful channel or their combination. The improved quality of the algorithm was demonstrated on two datasets: from a laboratory with controlled conditions and from smart speakers in natural conditions. The proposed algorithm was compared against several baselines in terms of the quality of noise reduction metrics, KWS metrics, and computing resources in comparison with existing solutions.

Multichannel Keyword Spotting for Noisy Conditions

TL;DR

This work tackles reliable on-device keyword spotting under noisy conditions by proposing a multichannel neural architecture that uses an attention mechanism to dynamically weight and fuse inputs from multiple microphones, including an adaptive noise cancellation channel. The authors integrate a multichannel noise reduction strategy with attention-based channel fusion, and evaluate on both controlled Acoustic Laboratory data and real-world smart speaker data, showing that Attention KWS + ANC achieves the best false rejection rate at a fixed false alarm rate while being CPU- and memory-efficient for on-device deployment. Key contributions include a baseline SVDF-like single-channel model, a channel-augmented ensemble, and a compact attention module (~50k parameters) that significantly improves performance over beamformers and ensemble methods. The proposed approach offers practical impact for robust wake-word activation on consumer devices and suggests future directions for selective channel transmission to cloud-based ASR systems to balance accuracy and bandwidth.

Abstract

This article presents a method for improving a keyword spotter (KWS) algorithm in noisy environments. Although beamforming (BF) and adaptive noise cancellation (ANC) techniques are robust in some conditions, they may degrade the performance of the activation system by distorting or suppressing useful signals. The authors propose a neural network architecture that uses several input channels and an attention mechanism that allows the network to determine the most useful channel or their combination. The improved quality of the algorithm was demonstrated on two datasets: from a laboratory with controlled conditions and from smart speakers in natural conditions. The proposed algorithm was compared against several baselines in terms of the quality of noise reduction metrics, KWS metrics, and computing resources in comparison with existing solutions.

Paper Structure

This paper contains 26 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Scheme of ensemble with multiple spotter models for multichannel input
  • Figure 2: The architecture of spotter model extended with attention network for multichannel input
  • Figure 3: SNR versus spotter activation rate plot for laboratory data