Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses
Shengkui Zhao, Trung Hieu Nguyen, Bin Ma
TL;DR
The paper addresses the challenge of monaural speech enhancement by introducing a Complex Convolutional Block Attention Module (CCBAM) to boost complex-valued network representations and a mixed time-frequency loss to jointly optimize TF-domain CRM estimation and time-domain performance. By integrating CCBAM into deep complex U-Net (DCUnet) and DCCRN architectures and training with a loss that combines SI-SNR with real/imaginary CRM errors, the authors present an end-to-end framework for enhanced speech quality. The main contributions are the design of complex channel- and spatial-attention gates, their integration into encoder-decoder structures with skip connections, and empirical validation on WSJ0/DEMAND RNNoise and DNS dataset collections showing consistent gains in PESQ, STOI, SI-SNR, and FwSegSNR. This work advances practical, high-quality monaural speech enhancement with end-to-end complex-valued networks and attention-driven feature recalibration, potentially improving robustness in real-world noisy environments.
Abstract
Deep complex U-Net structure and convolutional recurrent network (CRN) structure achieve state-of-the-art performance for monaural speech enhancement. Both deep complex U-Net and CRN are encoder and decoder structures with skip connections, which heavily rely on the representation power of the complex-valued convolutional layers. In this paper, we propose a complex convolutional block attention module (CCBAM) to boost the representation power of the complex-valued convolutional layers by constructing more informative features. The CCBAM is a lightweight and general module which can be easily integrated into any complex-valued convolutional layers. We integrate CCBAM with the deep complex U-Net and CRN to enhance their performance for speech enhancement. We further propose a mixed loss function to jointly optimize the complex models in both time-frequency (TF) domain and time domain. By integrating CCBAM and the mixed loss, we form a new end-to-end (E2E) complex speech enhancement framework. Ablation experiments and objective evaluations show the superior performance of the proposed approaches (https://github.com/modelscope/ClearerVoice-Studio).
