Spiking-LEAF: A Learnable Auditory front-end for Spiking Neural Networks
Zeyang Song, Jibin Wu, Malu Zhang, Mike Zheng Shou, Haizhou Li
TL;DR
This work tackles the limited speech performance of brain-inspired SNNs caused by suboptimal auditory front-ends. It proposes Spiking-LEAF, a fully learnable front-end that jointly optimizes a Gabor-filter-based acoustic feature extractor and an IHC-inspired two-compartment spiking neuron (IHC-LIF) enriched with lateral feedback. The system incorporates a spike-rate regularization loss, $L = L_{cls} + \lambda L_{SR}$ with $L_{SR} = \mathrm{ReLU}(R - SR)$, to promote sparse encoding and efficiency. On keyword spotting and speaker identification benchmarks, Spiking-LEAF achieves state-of-the-art accuracy, robustness to noise, and encoding efficiency, outperforming both conventional fbank features and prior spike-based front-ends, indicating strong potential for ultra-low-power neuromorphic speech processing at the edge.
Abstract
Brain-inspired spiking neural networks (SNNs) have demonstrated great potential for temporal signal processing. However, their performance in speech processing remains limited due to the lack of an effective auditory front-end. To address this limitation, we introduce Spiking-LEAF, a learnable auditory front-end meticulously designed for SNN-based speech processing. Spiking-LEAF combines a learnable filter bank with a novel two-compartment spiking neuron model called IHC-LIF. The IHC-LIF neurons draw inspiration from the structure of inner hair cells (IHC) and they leverage segregated dendritic and somatic compartments to effectively capture multi-scale temporal dynamics of speech signals. Additionally, the IHC-LIF neurons incorporate the lateral feedback mechanism along with spike regularization loss to enhance spike encoding efficiency. On keyword spotting and speaker identification tasks, the proposed Spiking-LEAF outperforms both SOTA spiking auditory front-ends and conventional real-valued acoustic features in terms of classification accuracy, noise robustness, and encoding efficiency.
