Table of Contents
Fetching ...

Frequency & Channel Attention Network for Small Footprint Noisy Spoken Keyword Spotting

Yuanxi Lin, Yuriy Evgenyevich Gapanyuk

TL;DR

The paper tackles robust keyword spotting on resource-constrained devices by introducing FCA-Net, a network that fuses ConvMixer-inspired feature interaction with an efficient two-dimensional convolution-based attention module (C2D) and a curriculum-based multi-condition training regime. The approach systematically compares lightweight attention mechanisms (SE, ECA, and C2D) and demonstrates that C2D provides fine-grained channel-frequency weighting with fewer parameters, yielding superior noise robustness. Extensive ablations show that placing C2D attention across all ConvMixer blocks delivers the best performance, while achieving a smaller footprint than many transformer-based models. Empirical results on Google Speech Commands V2 with MUSAN noise indicate FCA-Net outperforms state-of-the-art small-footprint models, including in challenging noisy conditions, making it well-suited for real-world edge deployments.

Abstract

In this paper, we aim to improve the robustness of Keyword Spotting (KWS) systems in noisy environments while keeping a small memory footprint. We propose a new convolutional neural network (CNN) called FCA-Net, which combines mixer unit-based feature interaction with a two-dimensional convolution-based attention module. First, we introduce and compare lightweight attention methods to enhance noise robustness in CNN. Then, we propose an attention module that creates fine-grained attention weights to capture channel and frequency-specific information, boosting the model's ability to handle noisy conditions. By combining the mixer unit-based feature interaction with the attention module, we enhance performance. Additionally, we use a curriculum-based multi-condition training strategy. Our experiments show that our system outperforms current state-of-the-art solutions for small-footprint KWS in noisy environments, making it reliable for real-world use.

Frequency & Channel Attention Network for Small Footprint Noisy Spoken Keyword Spotting

TL;DR

The paper tackles robust keyword spotting on resource-constrained devices by introducing FCA-Net, a network that fuses ConvMixer-inspired feature interaction with an efficient two-dimensional convolution-based attention module (C2D) and a curriculum-based multi-condition training regime. The approach systematically compares lightweight attention mechanisms (SE, ECA, and C2D) and demonstrates that C2D provides fine-grained channel-frequency weighting with fewer parameters, yielding superior noise robustness. Extensive ablations show that placing C2D attention across all ConvMixer blocks delivers the best performance, while achieving a smaller footprint than many transformer-based models. Empirical results on Google Speech Commands V2 with MUSAN noise indicate FCA-Net outperforms state-of-the-art small-footprint models, including in challenging noisy conditions, making it well-suited for real-world edge deployments.

Abstract

In this paper, we aim to improve the robustness of Keyword Spotting (KWS) systems in noisy environments while keeping a small memory footprint. We propose a new convolutional neural network (CNN) called FCA-Net, which combines mixer unit-based feature interaction with a two-dimensional convolution-based attention module. First, we introduce and compare lightweight attention methods to enhance noise robustness in CNN. Then, we propose an attention module that creates fine-grained attention weights to capture channel and frequency-specific information, boosting the model's ability to handle noisy conditions. By combining the mixer unit-based feature interaction with the attention module, we enhance performance. Additionally, we use a curriculum-based multi-condition training strategy. Our experiments show that our system outperforms current state-of-the-art solutions for small-footprint KWS in noisy environments, making it reliable for real-world use.
Paper Structure (17 sections, 7 equations, 3 figures, 2 tables)

This paper contains 17 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of ConvMixer model architecture.
  • Figure 2: An illustration of SE, ECA attention. GAP means global average pooling.
  • Figure 3: The network structure of the C2D block.