Table of Contents
Fetching ...

Global-Local Convolution with Spiking Neural Networks for Energy-efficient Keyword Spotting

Shuai Wang, Dehao Zhang, Kexin Shi, Yuchen Wang, Wenjie Wei, Jibin Wu, Malu Zhang

TL;DR

This work tackles energy-efficient keyword spotting on edge devices by proposing an end-to-end Spiking Neural Network (SNN) model built from Global-Local Spiking Convolution (GLSC) and Bottleneck-PLIF modules. The GLSC module enables sparse, dual-scale feature extraction by combining Conv1d and Dilated Conv1d with spiking dynamics, while the Bottleneck-PLIF module provides a lightweight classifier with learnable decay and channel fusion. Experiments on Google Speech Commands V1 and V2 demonstrate competitive accuracy with a significantly smaller parameter footprint and substantial energy savings (over $10\times$) compared with equivalent ANN baselines. Ablation studies corroborate the value of GLSC for preserving local/global information and of PLIF in achieving high accuracy with few parameters. Overall, the approach advances practical, energy-efficient KWS on neuromorphic hardware for edge devices.

Abstract

Thanks to Deep Neural Networks (DNNs), the accuracy of Keyword Spotting (KWS) has made substantial progress. However, as KWS systems are usually implemented on edge devices, energy efficiency becomes a critical requirement besides performance. Here, we take advantage of spiking neural networks' energy efficiency and propose an end-to-end lightweight KWS model. The model consists of two innovative modules: 1) Global-Local Spiking Convolution (GLSC) module and 2) Bottleneck-PLIF module. Compared to the hand-crafted feature extraction methods, the GLSC module achieves speech feature extraction that is sparser, more energy-efficient, and yields better performance. The Bottleneck-PLIF module further processes the signals from GLSC with the aim to achieve higher accuracy with fewer parameters. Extensive experiments are conducted on the Google Speech Commands Dataset (V1 and V2). The results show our method achieves competitive performance among SNN-based KWS models with fewer parameters.

Global-Local Convolution with Spiking Neural Networks for Energy-efficient Keyword Spotting

TL;DR

This work tackles energy-efficient keyword spotting on edge devices by proposing an end-to-end Spiking Neural Network (SNN) model built from Global-Local Spiking Convolution (GLSC) and Bottleneck-PLIF modules. The GLSC module enables sparse, dual-scale feature extraction by combining Conv1d and Dilated Conv1d with spiking dynamics, while the Bottleneck-PLIF module provides a lightweight classifier with learnable decay and channel fusion. Experiments on Google Speech Commands V1 and V2 demonstrate competitive accuracy with a significantly smaller parameter footprint and substantial energy savings (over ) compared with equivalent ANN baselines. Ablation studies corroborate the value of GLSC for preserving local/global information and of PLIF in achieving high accuracy with few parameters. Overall, the approach advances practical, energy-efficient KWS on neuromorphic hardware for edge devices.

Abstract

Thanks to Deep Neural Networks (DNNs), the accuracy of Keyword Spotting (KWS) has made substantial progress. However, as KWS systems are usually implemented on edge devices, energy efficiency becomes a critical requirement besides performance. Here, we take advantage of spiking neural networks' energy efficiency and propose an end-to-end lightweight KWS model. The model consists of two innovative modules: 1) Global-Local Spiking Convolution (GLSC) module and 2) Bottleneck-PLIF module. Compared to the hand-crafted feature extraction methods, the GLSC module achieves speech feature extraction that is sparser, more energy-efficient, and yields better performance. The Bottleneck-PLIF module further processes the signals from GLSC with the aim to achieve higher accuracy with fewer parameters. Extensive experiments are conducted on the Google Speech Commands Dataset (V1 and V2). The results show our method achieves competitive performance among SNN-based KWS models with fewer parameters.
Paper Structure (14 sections, 8 equations, 6 figures, 1 table)

This paper contains 14 sections, 8 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: A comparative analysis between single convolution and global-local convolution. (a) Dilated Conv1d with $stride=1$. The hidden layers and output features are highly redundant, as evidenced by the gray blocks representing the overlapping features. (b) Dilated Conv1d with $stride\neq1$. The receptive field exponentially increases with the dilation factor $d$, leading to a loss of local information as white blocks. (c) and (d) the Global-Local convolution method. it can achieve a good balance between global and local features in long speech sequences, and maintain a consistent focus on local features when the $stride\neq1$.
  • Figure 2: Our SNN-KWS model structure. It consists of $N_{Conv}=4$ GLSC blocks (right part) for better feature extraction, and $N_{Cla}=2$ Bottleneck-PLIF blocks (left part) for effective classification.
  • Figure 3: The Global-local convolution feature extraction in ANNs and GLSC layers. $U_{t+1}$ represents the membrane potential contribution of spiking neurons after decaying from $U_{t}$.
  • Figure 4: with the same inputs, these neurons with different $\tau$ result in varied leaky rates for neurons' membrane potential(right part), thereby leading to diverse output results(left part).
  • Figure 5: The average spike firing rate of our SNN-KWS model when $TimeSteps$ is 8 on the GSC-V1 dataset. The average spike firing rate of the entire network is approximately 8.3%.
  • ...and 1 more figures