Table of Contents
Fetching ...

SICRN: Advancing Speech Enhancement through State Space Model and Inplace Convolution Techniques

Changjiang Zhao, Shulin He, Xueliang Zhang

TL;DR

SICRN targets robust single-channel speech enhancement by addressing frequency-domain downsampling distortions and limited temporal modeling in conventional CRNs. It fuses a multidimensional state-space model (S4ND) with inplace convolution to capture global frequency dependencies and local structure without downsampling, followed by a lightweight temporal encoder (2-layer LSTM) and complex-mask reconstruction. Evaluated on the INTERSPEECH 2020 DNS dataset, SICRN achieves competitive performance with only 2.16M parameters and 4.24 G/s MACs, outperforming many baselines and operating causally without future frames. The results suggest that combining S4ND and inplace convolution yields high-quality, efficient speech enhancement suitable for real-time deployment and challenging acoustic conditions.

Abstract

Speech enhancement aims to improve speech quality and intelligibility, especially in noisy environments where background noise degrades speech signals. Currently, deep learning methods achieve great success in speech enhancement, e.g. the representative convolutional recurrent neural network (CRN) and its variants. However, CRN typically employs consecutive downsampling and upsampling convolution for frequency modeling, which destroys the inherent structure of the signal over frequency. Additionally, convolutional layers lacks of temporal modelling abilities. To address these issues, we propose an innovative module combing a State space model and Inplace Convolution (SIC), and to replace the conventional convolution in CRN, called SICRN. Specifically, a dual-path multidimensional State space model captures the global frequencies dependency and long-term temporal dependencies. Meanwhile, the 2D-inplace convolution is used to capture the local structure, which abandons the downsampling and upsampling. Systematic evaluations on the public INTERSPEECH 2020 DNS challenge dataset demonstrate SICRN's efficacy. Compared to strong baselines, SICRN achieves performance close to state-of-the-art while having advantages in model parameters, computations, and algorithmic delay. The proposed SICRN shows great promise for improved speech enhancement.

SICRN: Advancing Speech Enhancement through State Space Model and Inplace Convolution Techniques

TL;DR

SICRN targets robust single-channel speech enhancement by addressing frequency-domain downsampling distortions and limited temporal modeling in conventional CRNs. It fuses a multidimensional state-space model (S4ND) with inplace convolution to capture global frequency dependencies and local structure without downsampling, followed by a lightweight temporal encoder (2-layer LSTM) and complex-mask reconstruction. Evaluated on the INTERSPEECH 2020 DNS dataset, SICRN achieves competitive performance with only 2.16M parameters and 4.24 G/s MACs, outperforming many baselines and operating causally without future frames. The results suggest that combining S4ND and inplace convolution yields high-quality, efficient speech enhancement suitable for real-time deployment and challenging acoustic conditions.

Abstract

Speech enhancement aims to improve speech quality and intelligibility, especially in noisy environments where background noise degrades speech signals. Currently, deep learning methods achieve great success in speech enhancement, e.g. the representative convolutional recurrent neural network (CRN) and its variants. However, CRN typically employs consecutive downsampling and upsampling convolution for frequency modeling, which destroys the inherent structure of the signal over frequency. Additionally, convolutional layers lacks of temporal modelling abilities. To address these issues, we propose an innovative module combing a State space model and Inplace Convolution (SIC), and to replace the conventional convolution in CRN, called SICRN. Specifically, a dual-path multidimensional State space model captures the global frequencies dependency and long-term temporal dependencies. Meanwhile, the 2D-inplace convolution is used to capture the local structure, which abandons the downsampling and upsampling. Systematic evaluations on the public INTERSPEECH 2020 DNS challenge dataset demonstrate SICRN's efficacy. Compared to strong baselines, SICRN achieves performance close to state-of-the-art while having advantages in model parameters, computations, and algorithmic delay. The proposed SICRN shows great promise for improved speech enhancement.
Paper Structure (14 sections, 15 equations, 1 figure, 4 tables)