DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement
Tao Sun, Sander Bohté
TL;DR
This work tackles the challenge of low-latency, energy-efficient speech enhancement for power-constrained devices by introducing DPSNN, a two-phase time-domain spiking neural network. The separator combines a Spiking Convolutional Neural Network (SCNN) to capture temporal context and a Spiking Recurrent Neural Network (SRNN) to capture frequency-context, producing a mask for denoising within a mask-based encoder-separator-decoder pipeline. Training leverages surrogate-gradient methods and an activation-suppression regularizer to improve sparsity and energy efficiency, achieving roughly 5 ms latency with competitive SI-SNR, PESQ, DNSMOS, and STOI scores on VCTK and Intel DNS datasets, and outperforming several baselines in latency-energy trade-offs. The approach demonstrates strong potential for real-world neuromorphic SE applications in hearing aids and mobile devices, balancing speech quality, intelligibility, and power consumption.
Abstract
Speech enhancement (SE) improves communication in noisy environments, affecting areas such as automatic speech recognition, hearing aids, and telecommunications. With these domains typically being power-constrained and event-based while requiring low latency, neuromorphic algorithms in the form of spiking neural networks (SNNs) have great potential. Yet, current effective SNN solutions require a contextual sampling window imposing substantial latency, typically around 32ms, too long for many applications. Inspired by Dual-Path Spiking Neural Networks (DPSNNs) in classical neural networks, we develop a two-phase time-domain streaming SNN framework -- the Dual-Path Spiking Neural Network (DPSNN). In the DPSNN, the first phase uses Spiking Convolutional Neural Networks (SCNNs) to capture global contextual information, while the second phase uses Spiking Recurrent Neural Networks (SRNNs) to focus on frequency-related features. In addition, the regularizer suppresses activation to further enhance energy efficiency of our DPSNNs. Evaluating on the VCTK and Intel DNS Datasets, we demonstrate that our approach achieves the very low latency (approximately 5ms) required for applications like hearing aids, while demonstrating excellent signal-to-noise ratio (SNR), perceptual quality, and energy efficiency.
