Improving Speech Enhancement by Cross- and Sub-band Processing with State Space Model

Jizhen Li; Weiping Tu; Yuhong Yang; Xinmeng Xu; Yiqun Zhang; Yanzhen Ren

Improving Speech Enhancement by Cross- and Sub-band Processing with State Space Model

Jizhen Li, Weiping Tu, Yuhong Yang, Xinmeng Xu, Yiqun Zhang, Yanzhen Ren

TL;DR

This work tackles the limitations of applying a single state-space model across diverse sub-bands in speech enhancement, specifically addressing high-frequency spectral detail loss. It introduces Cross- and Sub-band Mamba (CSMamba), combining Band Split Block, Spectrum Restoration Block, and Channel Integrating Block within a Bi-directional SSM framework to handle sub-band differences and preserve spectral structure. Evaluated on the DNS Challenge 2021 dataset, CSMamba achieves state-of-the-art performance with only 1.73 million parameters, outperforming Transformer- and Mamba-based baselines across PESQ, STOI, and SI-SNRi. The approach demonstrates that flexible sub-band processing and multi-perspective spectral restoration improve robustness and efficiency in real-time speech enhancement subjects.

Abstract

Recently, the state space model (SSM) represented by Mamba has shown remarkable performance in long-term sequence modeling tasks, including speech enhancement. However, due to substantial differences in sub-band features, applying the same SSM to all sub-bands limits its inference capability. Additionally, when processing each time frame of the time-frequency representation, the SSM may forget certain high-frequency information of low energy, making the restoration of structure in the high-frequency bands challenging. For this reason, we propose Cross- and Sub-band Mamba (CSMamba). To assist the SSM in handling different sub-band features flexibly, we propose a band split block that splits the full-band into four sub-bands with different widths based on their information similarity. We then allocate independent weights to each sub-band, thereby reducing the inference burden on the SSM. Furthermore, to mitigate the forgetting of low-energy information in the high-frequency bands by the SSM, we introduce a spectrum restoration block that enhances the representation of the cross-band features from multiple perspectives. Experimental results on the DNS Challenge 2021 dataset demonstrate that CSMamba outperforms several state-of-the-art (SOTA) speech enhancement methods in three objective evaluation metrics with fewer parameters.

Improving Speech Enhancement by Cross- and Sub-band Processing with State Space Model

TL;DR

Abstract

Improving Speech Enhancement by Cross- and Sub-band Processing with State Space Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)