Towards Ultra-Low-Power Neuromorphic Speech Enhancement with Spiking-FullSubNet

Xiang Hao; Chenxiang Ma; Qu Yang; Jibin Wu; Kay Chen Tan

Towards Ultra-Low-Power Neuromorphic Speech Enhancement with Spiking-FullSubNet

Xiang Hao, Chenxiang Ma, Qu Yang, Jibin Wu, Kay Chen Tan

TL;DR

This work proposes an ultra-low-power speech enhancement system based on the brain-inspired spiking neural network (SNN) called Spiking-FullSubNet, which surpasses state-of-the-art methods by large margins in terms of both speech quality and energy efficiency metrics.

Abstract

Speech enhancement is critical for improving speech intelligibility and quality in various audio devices. In recent years, deep learning-based methods have significantly improved speech enhancement performance, but they often come with a high computational cost, which is prohibitive for a large number of edge devices, such as headsets and hearing aids. This work proposes an ultra-low-power speech enhancement system based on the brain-inspired spiking neural network (SNN) called Spiking-FullSubNet. Spiking-FullSubNet follows a full-band and sub-band fusioned approach to effectively capture both global and local spectral information. To enhance the efficiency of computationally expensive sub-band modeling, we introduce a frequency partitioning method inspired by the sensitivity profile of the human peripheral auditory system. Furthermore, we introduce a novel spiking neuron model that can dynamically control the input information integration and forgetting, enhancing the multi-scale temporal processing capability of SNN, which is critical for speech denoising. Experiments conducted on the recent Intel Neuromorphic Deep Noise Suppression (N-DNS) Challenge dataset show that the Spiking-FullSubNet surpasses state-of-the-art methods by large margins in terms of both speech quality and energy efficiency metrics. Notably, our system won the championship of the Intel N-DNS Challenge (Algorithmic Track), opening up a myriad of opportunities for ultra-low-power speech enhancement at the edge. Our source code and model checkpoints are publicly available at https://github.com/haoxiangsnr/spiking-fullsubnet.

Towards Ultra-Low-Power Neuromorphic Speech Enhancement with Spiking-FullSubNet

TL;DR

Abstract

Paper Structure (30 sections, 26 equations, 8 figures, 3 tables)

This paper contains 30 sections, 26 equations, 8 figures, 3 tables.

Introduction
Related Works
Spiking Neuron Model
Speech Enhancement
Sub-band Modeling in Speech Enhancement
Neuromorphic Speech Processing
Background
Spiking Neuron Model
Formulation of Speech Enhancement
Method
Gated Spiking Neuron
Spiking-FullSubNet Architecture
Full-Band Processing
Existing Sub-Band Processing Approach
Sub-Band Processing Based on Frequency Partitioning
...and 15 more sections

Figures (8)

Figure 1: Block diagram of the proposed real-time neuromorphic speech enhancement system.
Figure 2: Time-frequency magnitude spectrogram of different signals. The speech enhancement methods aim at recovering a clean speech from a noisy observation by removing the unwanted noise.
Figure 3: Diagram of the proposed Spiking-FullSubNet architecture. The architecture integrates a full-band model and sub-band models, with gated spiking neurons serving as the core of each model, to effectively enhance noisy speech signals. The full-band model operates on the noisy magnitude spectrogram to capture global spectral patterns, while the sub-band components focus on specific frequency bands to effectively model local spectral information. By incorporating newly proposed GSNs into both the full-band and sub-band models, the temporal processing capability is greatly improved. Finally, deep filtering is employed as the training target to obtain the enhanced spectrogram.
Figure 4: Illustration of the proposed GSN model, which regulates the membrane decay rate at each time step based on the feedforward and recurrent inputs.
Figure 5: Illustration of the sub-band processing in Spiking-FullSubNet. The input is a vector $\mathbf{x}_f(n)$ comprising the magnitude bin of frequency $f$, its $4$ neighboring frequency bins, and the corresponding bin in spectral embedding output from the full-band model.
...and 3 more figures

Towards Ultra-Low-Power Neuromorphic Speech Enhancement with Spiking-FullSubNet

TL;DR

Abstract

Towards Ultra-Low-Power Neuromorphic Speech Enhancement with Spiking-FullSubNet

Authors

TL;DR

Abstract

Table of Contents

Figures (8)