Table of Contents
Fetching ...

ED-sKWS: Early-Decision Spiking Neural Networks for Rapid,and Energy-Efficient Keyword Spotting

Zeyang Song, Qianhui Liu, Qu Yang, Yizhou Peng, Haizhou Li

TL;DR

The paper tackles rapid, energy-efficient keyword spotting on edge devices by leveraging Spiking Neural Networks (SNNs) with an early-decision capability. It integrates a Cumulative Temporal (CT) loss to optimize predictions across timesteps, using the accumulated output $O[t]$ to guide learning, exemplified by $O[t]=\sum_{i=0}^t \mathrm{softmax}(U_R[i])$ and $L_{CT}=\frac{1}{T}\sum_{t=0}^T L_{CE}[O[t], y]$. A new SC-100 dataset with precise begin/end timestamps for 100 keywords enables accurate evaluation of early stopping and timing. Experimental results on Google Speech Commands v2 and SC-100 show competitive accuracy at reduced timesteps (about $61\%$) and significantly lower energy (about $52\%$), validating the approach for real-time, energy-conscious KWS in edge settings. The work demonstrates that early-decision SNNs, guided by CT loss, can deliver fast, reliable keyword spotting with meaningful energy savings, supported by a dedicated dataset for timing analysis.

Abstract

Keyword Spotting (KWS) is essential in edge computing requiring rapid and energy-efficient responses. Spiking Neural Networks (SNNs) are well-suited for KWS for their efficiency and temporal capacity for speech. To further reduce the latency and energy consumption, this study introduces ED-sKWS, an SNN-based KWS model with an early-decision mechanism that can stop speech processing and output the result before the end of speech utterance. Furthermore, we introduce a Cumulative Temporal (CT) loss that can enhance prediction accuracy at both the intermediate and final timesteps. To evaluate early-decision performance, we present the SC-100 dataset including 100 speech commands with beginning and end timestamp annotation. Experiments on the Google Speech Commands v2 and our SC-100 datasets show that ED-sKWS maintains competitive accuracy with 61% timesteps and 52% energy consumption compared to SNN models without early-decision mechanism, ensuring rapid response and energy efficiency.

ED-sKWS: Early-Decision Spiking Neural Networks for Rapid,and Energy-Efficient Keyword Spotting

TL;DR

The paper tackles rapid, energy-efficient keyword spotting on edge devices by leveraging Spiking Neural Networks (SNNs) with an early-decision capability. It integrates a Cumulative Temporal (CT) loss to optimize predictions across timesteps, using the accumulated output to guide learning, exemplified by and . A new SC-100 dataset with precise begin/end timestamps for 100 keywords enables accurate evaluation of early stopping and timing. Experimental results on Google Speech Commands v2 and SC-100 show competitive accuracy at reduced timesteps (about ) and significantly lower energy (about ), validating the approach for real-time, energy-conscious KWS in edge settings. The work demonstrates that early-decision SNNs, guided by CT loss, can deliver fast, reliable keyword spotting with meaningful energy savings, supported by a dedicated dataset for timing analysis.

Abstract

Keyword Spotting (KWS) is essential in edge computing requiring rapid and energy-efficient responses. Spiking Neural Networks (SNNs) are well-suited for KWS for their efficiency and temporal capacity for speech. To further reduce the latency and energy consumption, this study introduces ED-sKWS, an SNN-based KWS model with an early-decision mechanism that can stop speech processing and output the result before the end of speech utterance. Furthermore, we introduce a Cumulative Temporal (CT) loss that can enhance prediction accuracy at both the intermediate and final timesteps. To evaluate early-decision performance, we present the SC-100 dataset including 100 speech commands with beginning and end timestamp annotation. Experiments on the Google Speech Commands v2 and our SC-100 datasets show that ED-sKWS maintains competitive accuracy with 61% timesteps and 52% energy consumption compared to SNN models without early-decision mechanism, ensuring rapid response and energy efficiency.
Paper Structure (13 sections, 2 equations, 2 figures, 3 tables)

This paper contains 13 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overall diagram of ED-sKWS with CT loss and early decision mechanism. The ED-sKWS system processes the fbank features of the input audio using a feed-forward SNN, which is indicated by the grey box. In the feedforward SNN, we feed frame $t$ to timestep $t$ of SNN, processing the data frame-by-frame. With an early-decision mechanism, the ED-sKWS ends processing and delivers the output once the confidence score ($CS$) exceeds a predefined threshold ($C$).
  • Figure 2: Visualization of the raw audio signal and the spike firing rate of a sample from SC-100 dataset. Ground truth start ($t_{start}$) and end ($t_{end}$) points of keywords are indicated by yellow dashed lines, while the early ($t_{d}$) and late ($T$) decision time is denoted by a red and orange dashed line respectively.