DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement

Tao Sun; Sander Bohté

DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement

Tao Sun, Sander Bohté

TL;DR

This work tackles the challenge of low-latency, energy-efficient speech enhancement for power-constrained devices by introducing DPSNN, a two-phase time-domain spiking neural network. The separator combines a Spiking Convolutional Neural Network (SCNN) to capture temporal context and a Spiking Recurrent Neural Network (SRNN) to capture frequency-context, producing a mask for denoising within a mask-based encoder-separator-decoder pipeline. Training leverages surrogate-gradient methods and an activation-suppression regularizer to improve sparsity and energy efficiency, achieving roughly 5 ms latency with competitive SI-SNR, PESQ, DNSMOS, and STOI scores on VCTK and Intel DNS datasets, and outperforming several baselines in latency-energy trade-offs. The approach demonstrates strong potential for real-world neuromorphic SE applications in hearing aids and mobile devices, balancing speech quality, intelligibility, and power consumption.

Abstract

Speech enhancement (SE) improves communication in noisy environments, affecting areas such as automatic speech recognition, hearing aids, and telecommunications. With these domains typically being power-constrained and event-based while requiring low latency, neuromorphic algorithms in the form of spiking neural networks (SNNs) have great potential. Yet, current effective SNN solutions require a contextual sampling window imposing substantial latency, typically around 32ms, too long for many applications. Inspired by Dual-Path Spiking Neural Networks (DPSNNs) in classical neural networks, we develop a two-phase time-domain streaming SNN framework -- the Dual-Path Spiking Neural Network (DPSNN). In the DPSNN, the first phase uses Spiking Convolutional Neural Networks (SCNNs) to capture global contextual information, while the second phase uses Spiking Recurrent Neural Networks (SRNNs) to focus on frequency-related features. In addition, the regularizer suppresses activation to further enhance energy efficiency of our DPSNNs. Evaluating on the VCTK and Intel DNS Datasets, we demonstrate that our approach achieves the very low latency (approximately 5ms) required for applications like hearing aids, while demonstrating excellent signal-to-noise ratio (SNR), perceptual quality, and energy efficiency.

DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement

TL;DR

Abstract

Paper Structure (26 sections, 8 equations, 4 figures, 6 tables)

This paper contains 26 sections, 8 equations, 4 figures, 6 tables.

Introduction
Related work
Time-domain solutions
Speech enhancement with SNNs
Methods
Problem setup
Spiking neural networks (SNNs)
Architecture
SCNN layer
SRNN layer
Experiments
Datasets
VCTK Datasets
Intel DNS Dataset
Evaluation metrics
...and 11 more sections

Figures (4)

Figure 1: a, The algorithmic latency in block-based models consists of a buffering latency and a look-ahead latency. The buffering latency matches the frame-shift length (i.e. block size), whereas the look-ahead latency results from the extra look-ahead within a frame, typically used to provide additional processing context to improve performance. b, A frequency-domain DNN model transforms a noisy audio signal to its T-F representation by the STFT and then fed it into a neural network. c, Inputs and outputs to the time-domain DNN models are both time-domain signals. d, Mask-based time-domain DNN models commonly adopt an encoder-separator-decoder architecture.
Figure 2: The proposed DPSNN adopts the encoder-separator-decoder architecture. The encoder uses convolutions to convert waveform signals into encoded 2D feature maps, effectively replacing the function of STFT. In the separator, 2D masks are calculated, primarily relying on the SCNN and SRNN modules that capture the temporal and frequency contextual information of the encoded feature maps, respectively. After applying the masks to the feature maps from the encoder, the decoder transforms the masked feature maps back to enhanced waveform signals.
Figure 3: a, In the mask-based encoder-separator-decoder architecture, the encoder converts overlapping frames into 1D features through convolution and aligns them into a 2D feature map. Each 1D feature is processed in one time step in the subsequent spiking layers. b, In the SCNN layer, a group convolution is applied along the temporal axis of a feature map to capture temporal contextual information. c, The SRNN layer is a fully-connected recurrent spiking layer that integrates contexts along the frequency direction of its input 2D feature map. d, Readout is done using a fully-connected readout layer with non-spiking neurons, where the membrane potential of these neurons is calculated and output without any spiking or resetting.
Figure 4: Influence of the length of input examples on model performance. The channels in the model are $N=512, B=256, H=512$. The filter length in the encoder is $L=80$, resulting in a 5ms latency. The size of the context step is 4.

DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement

TL;DR

Abstract

DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement

Authors

TL;DR

Abstract

Table of Contents

Figures (4)