Table of Contents
Fetching ...

RaD-Net 2: A causal two-stage repairing and denoising speech enhancement network with knowledge distillation and complex axial self-attention

Mingshuai Liu, Zhuangqi Chen, Xiaopeng Yan, Yuanjun Lv, Xianjun Xia, Chuanzeng Huang, Yijian Xiao, Lei Xie

TL;DR

RaD-Net 2 tackles real-time speech enhancement under diverse distortions by extending the two-stage RaD-Net with a causality-based knowledge distillation scheme that leverages future information in a causal framework, and by introducing complex axial self-attention in the denoising stage to capture long-range spectral relations. The repairing stage employs a non-causal teacher to guide a causal student, while the denoising stage incorporates complex ASA to better model complex spectra; training uses a combination of frequency-domain and adversarial losses. Empirical results show a $0.10$ OVRL DNSMOS improvement over RaD-Net on the ICASSP 2024 SSI blind test set and superior performance against the 2023 SSI state-of-the-art Gesper on the corresponding blind test, demonstrating improved speech quality and robustness. These contributions have practical significance for deploying higher-quality, real-time speech enhancement in communication systems and related applications.

Abstract

In real-time speech communication systems, speech signals are often degraded by multiple distortions. Recently, a two-stage Repair-and-Denoising network (RaD-Net) was proposed with superior speech quality improvement in the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. However, failure to use future information and constraint receptive field of convolution layers limit the system's performance. To mitigate these problems, we extend RaD-Net to its upgraded version, RaD-Net 2. Specifically, a causality-based knowledge distillation is introduced in the first stage to use future information in a causal way. We use the non-causal repairing network as the teacher to improve the performance of the causal repairing network. In addition, in the second stage, complex axial self-attention is applied in the denoising network's complex feature encoder/decoder. Experimental results on the ICASSP 2024 SSI Challenge blind test set show that RaD-Net 2 brings 0.10 OVRL DNSMOS improvement compared to RaD-Net.

RaD-Net 2: A causal two-stage repairing and denoising speech enhancement network with knowledge distillation and complex axial self-attention

TL;DR

RaD-Net 2 tackles real-time speech enhancement under diverse distortions by extending the two-stage RaD-Net with a causality-based knowledge distillation scheme that leverages future information in a causal framework, and by introducing complex axial self-attention in the denoising stage to capture long-range spectral relations. The repairing stage employs a non-causal teacher to guide a causal student, while the denoising stage incorporates complex ASA to better model complex spectra; training uses a combination of frequency-domain and adversarial losses. Empirical results show a OVRL DNSMOS improvement over RaD-Net on the ICASSP 2024 SSI blind test set and superior performance against the 2023 SSI state-of-the-art Gesper on the corresponding blind test, demonstrating improved speech quality and robustness. These contributions have practical significance for deploying higher-quality, real-time speech enhancement in communication systems and related applications.

Abstract

In real-time speech communication systems, speech signals are often degraded by multiple distortions. Recently, a two-stage Repair-and-Denoising network (RaD-Net) was proposed with superior speech quality improvement in the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. However, failure to use future information and constraint receptive field of convolution layers limit the system's performance. To mitigate these problems, we extend RaD-Net to its upgraded version, RaD-Net 2. Specifically, a causality-based knowledge distillation is introduced in the first stage to use future information in a causal way. We use the non-causal repairing network as the teacher to improve the performance of the causal repairing network. In addition, in the second stage, complex axial self-attention is applied in the denoising network's complex feature encoder/decoder. Experimental results on the ICASSP 2024 SSI Challenge blind test set show that RaD-Net 2 brings 0.10 OVRL DNSMOS improvement compared to RaD-Net.
Paper Structure (11 sections, 8 equations, 2 figures, 2 tables)

This paper contains 11 sections, 8 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The architecture of RaD-Net 2.
  • Figure 2: The architecture of the Complex ASA module.