KS-Net: Multi-band joint speech restoration and enhancement network for 2024 ICASSP SSI Challenge
Guochen Yu, Runqiang Han, Chenglin Xu, Haoran Zhao, Nan Li, Chen Zhang, Xiguang Zheng, Chao Zhou, Qi Huang, Bing Yu
TL;DR
KS-Net addresses robust speech restoration and enhancement under complex acoustic distortions by combining a complex-domain restoration GAN with a multi-band fusion speech enhancer. The restoration stage performs coarse correction in the complex spectrum using a multi-subband encoder–decoder with S-TCM and TF-LSTM, coupled with multi-resolution STFT discriminators and a composite loss; the second stage MF-Net refines spectral details across ERB-scaled bands via band-specific sub-networks. The approach is validated on the 2024 ICASSP SSI Challenge, achieving first place in both real-time and non-real-time tracks with MOS scores around 3.49 and 3.43 respectively and WAcc of 0.78, demonstrating strong practical impact for real-time speech restoration. The results indicate effective handling of diverse distortions like noise, reverberation, packet loss, and loudness variations, with a practical real-time implementation.
Abstract
This paper presents the speech restoration and enhancement system created by the 1024K team for the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. Our system consists of a generative adversarial network (GAN) in complex-domain for speech restoration and a fine-grained multi-band fusion module for speech enhancement. In the blind test set of SSI, the proposed system achieves an overall mean opinion score (MOS) of 3.49 based on ITU-T P.804 and a Word Accuracy Rate (WAcc) of 0.78 for the real-time track, as well as an overall P.804 MOS of 3.43 and a WAcc of 0.78 for the non-real-time track, ranking 1st in both tracks.
