Spiking Structured State Space Model for Monaural Speech Enhancement
Yu Du, Xu Liu, Yansong Chua
TL;DR
This work tackles the challenge of extracting clean speech from noisy monaural signals while reducing computational cost. It proposes Spiking-S4, a hybrid architecture that merges Spiking Neural Networks with Structured State Space Models to capture long-range dependencies efficiently. Experiments on the DNS Challenge 2023 and Voice-Bank+Demand show that Spiking-S4 achieves competitive or superior performance compared with state-of-the-art ANN baselines, but with far fewer parameters and FLOPs. This work points to a promising direction for energy-efficient, real-time speech enhancement using neuromorphic-inspired components.
Abstract
Speech enhancement seeks to extract clean speech from noisy signals. Traditional deep learning methods face two challenges: efficiently using information in long speech sequences and high computational costs. To address these, we introduce the Spiking Structured State Space Model (Spiking-S4). This approach merges the energy efficiency of Spiking Neural Networks (SNN) with the long-range sequence modeling capabilities of Structured State Space Models (S4), offering a compelling solution. Evaluation on the DNS Challenge and VoiceBank+Demand Datasets confirms that Spiking-S4 rivals existing Artificial Neural Network (ANN) methods but with fewer computational resources, as evidenced by reduced parameters and Floating Point Operations (FLOPs).
