Table of Contents
Fetching ...

SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition

Jiaqi Wang, Liutao Yu, Xiongri Shen, Sihang Guo, Chenlin Zhou, Leilei Zhao, Yi Zhong, Zhiguo Zhang, Zhengyu Ma

TL;DR

Spiking neural networks promise energy-efficient speech command recognition, but existing approaches struggle to model long-range temporal dependencies. The authors introduce MSTASA, a multi-view spiking temporal-aware self-attention module, and SpikCommander, a fully spike-driven transformer that combines MSTASA with SCR-MLP and a spiking embedding extractor to jointly enhance temporal context and inter-channel integration. Across SHD, SSC, and GSC, SpikCommander achieves state-of-the-art accuracy with fewer parameters and lower estimated energy/SOPs than prior SNN methods, while maintaining strong performance with varying time steps. This work demonstrates a scalable, energy-efficient SCR backbone suitable for neuromorphic hardware, with robust long-term temporal modeling and competitive real-time performance in resource-constrained environments.

Abstract

Spiking neural networks (SNNs) offer a promising path toward energy-efficient speech command recognition (SCR) by leveraging their event-driven processing paradigm. However, existing SNN-based SCR methods often struggle to capture rich temporal dependencies and contextual information from speech due to limited temporal modeling and binary spike-based representations. To address these challenges, we first introduce the multi-view spiking temporal-aware self-attention (MSTASA) module, which combines effective spiking temporal-aware attention with a multi-view learning framework to model complementary temporal dependencies in speech commands. Building on MSTASA, we further propose SpikCommander, a fully spike-driven transformer architecture that integrates MSTASA with a spiking contextual refinement channel MLP (SCR-MLP) to jointly enhance temporal context modeling and channel-wise feature integration. We evaluate our method on three benchmark datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC), and the Google Speech Commands V2 (GSC). Extensive experiments demonstrate that SpikCommander consistently outperforms state-of-the-art (SOTA) SNN approaches with fewer parameters under comparable time steps, highlighting its effectiveness and efficiency for robust speech command recognition.

SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition

TL;DR

Spiking neural networks promise energy-efficient speech command recognition, but existing approaches struggle to model long-range temporal dependencies. The authors introduce MSTASA, a multi-view spiking temporal-aware self-attention module, and SpikCommander, a fully spike-driven transformer that combines MSTASA with SCR-MLP and a spiking embedding extractor to jointly enhance temporal context and inter-channel integration. Across SHD, SSC, and GSC, SpikCommander achieves state-of-the-art accuracy with fewer parameters and lower estimated energy/SOPs than prior SNN methods, while maintaining strong performance with varying time steps. This work demonstrates a scalable, energy-efficient SCR backbone suitable for neuromorphic hardware, with robust long-term temporal modeling and competitive real-time performance in resource-constrained environments.

Abstract

Spiking neural networks (SNNs) offer a promising path toward energy-efficient speech command recognition (SCR) by leveraging their event-driven processing paradigm. However, existing SNN-based SCR methods often struggle to capture rich temporal dependencies and contextual information from speech due to limited temporal modeling and binary spike-based representations. To address these challenges, we first introduce the multi-view spiking temporal-aware self-attention (MSTASA) module, which combines effective spiking temporal-aware attention with a multi-view learning framework to model complementary temporal dependencies in speech commands. Building on MSTASA, we further propose SpikCommander, a fully spike-driven transformer architecture that integrates MSTASA with a spiking contextual refinement channel MLP (SCR-MLP) to jointly enhance temporal context modeling and channel-wise feature integration. We evaluate our method on three benchmark datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC), and the Google Speech Commands V2 (GSC). Extensive experiments demonstrate that SpikCommander consistently outperforms state-of-the-art (SOTA) SNN approaches with fewer parameters under comparable time steps, highlighting its effectiveness and efficiency for robust speech command recognition.

Paper Structure

This paper contains 28 sections, 20 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Illustration of three spiking transformer block variants with different modeling strategies: (a). Global spiking self-attention with channel MLP; (b). Hybrid spiking self-attention and convolution with channel MLP; (c). Multi-view temporal-aware self-attention with spiking contextual refinement MLP (our SpikCommander).
  • Figure 2: Two key modules of the SpikCommander architecture. (a). Spiking embedding extractor (SEE) encodes speech inputs into spiking embeddings for subsequent attention processing; (b). Spiking contextual refinement MLP (SCR-MLP) integrates spike-aware channel mixing and selective contextual refinement to enhance both spatial and temporal feature learning.
  • Figure 3: Illustration of the multi-view spiking temporal-aware self-attention (MSTASA). (a). Architecture combining local sliding-window STASA, long-range STASA, and a complementary convolutional V-branch; (b). Internal mechanism of STASA.
  • Figure 4: Long-term learning performance of 2-block SpikCommander on SSC and GSC under varying time steps.
  • Figure 5: Comparison of the “Four” command in the GSC dataset before and after augmentation under 100 time steps.
  • ...and 4 more figures