Table of Contents
Fetching ...

Universal Discrete-Domain Speech Enhancement

Fei Liu, Yang Ai, Ye-Xin Lu, Rui-Chen Zheng, Hui-Peng Du, Zhen-Hua Ling

TL;DR

UDSE tackles universal speech enhancement by reframing SE as a discrete-domain token prediction problem guided by a residual vector quantizer (RVQ) of a pre-trained neural speech codec. It uses global conditioning from degraded speech to sequentially predict clean acoustic tokens and then decodes them to restore the waveform, without relying on textual cues or large language models. The approach achieves strong performance across conventional, unconventional, and mixed distortions, often surpassing regression-based and diffusion-based baselines in objective and subjective evaluations. This work highlights the practicality and robustness of discrete-domain SE for real-world, multi-distortion scenarios, and shows codec-generalization potential across RVQ-based systems.

Abstract

In real-world scenarios, speech signals are inevitably corrupted by various types of interference, making speech enhancement (SE) a critical task for robust speech processing. However, most existing SE methods only handle a limited range of distortions, such as additive noise, reverberation, or band limitation, while the study of SE under multiple simultaneous distortions remains limited. This gap affects the generalization and practical usability of SE methods in real-world environments.To address this gap, this paper proposes a novel Universal Discrete-domain SE model called UDSE.Unlike regression-based SE models that directly predict clean speech waveform or continuous features, UDSE redefines SE as a discrete-domain classification task, instead predicting the clean discrete tokens quantized by the residual vector quantizer (RVQ) of a pre-trained neural speech codec.Specifically, UDSE first extracts global features from the degraded speech. Guided by these global features, the clean token prediction for each VQ follows the rules of RVQ, where the prediction of each VQ relies on the results of the preceding ones. Finally, the predicted clean tokens from all VQs are decoded to reconstruct the clean speech waveform. During training, the UDSE model employs a teacher-forcing strategy, and is optimized with cross-entropy loss. Experimental results confirm that the proposed UDSE model can effectively enhance speech degraded by various conventional and unconventional distortions, e.g., additive noise, reverberation, band limitation, clipping, phase distortion, and compression distortion, as well as their combinations. These results demonstrate the superior universality and practicality of UDSE compared to advanced regression-based SE methods.

Universal Discrete-Domain Speech Enhancement

TL;DR

UDSE tackles universal speech enhancement by reframing SE as a discrete-domain token prediction problem guided by a residual vector quantizer (RVQ) of a pre-trained neural speech codec. It uses global conditioning from degraded speech to sequentially predict clean acoustic tokens and then decodes them to restore the waveform, without relying on textual cues or large language models. The approach achieves strong performance across conventional, unconventional, and mixed distortions, often surpassing regression-based and diffusion-based baselines in objective and subjective evaluations. This work highlights the practicality and robustness of discrete-domain SE for real-world, multi-distortion scenarios, and shows codec-generalization potential across RVQ-based systems.

Abstract

In real-world scenarios, speech signals are inevitably corrupted by various types of interference, making speech enhancement (SE) a critical task for robust speech processing. However, most existing SE methods only handle a limited range of distortions, such as additive noise, reverberation, or band limitation, while the study of SE under multiple simultaneous distortions remains limited. This gap affects the generalization and practical usability of SE methods in real-world environments.To address this gap, this paper proposes a novel Universal Discrete-domain SE model called UDSE.Unlike regression-based SE models that directly predict clean speech waveform or continuous features, UDSE redefines SE as a discrete-domain classification task, instead predicting the clean discrete tokens quantized by the residual vector quantizer (RVQ) of a pre-trained neural speech codec.Specifically, UDSE first extracts global features from the degraded speech. Guided by these global features, the clean token prediction for each VQ follows the rules of RVQ, where the prediction of each VQ relies on the results of the preceding ones. Finally, the predicted clean tokens from all VQs are decoded to reconstruct the clean speech waveform. During training, the UDSE model employs a teacher-forcing strategy, and is optimized with cross-entropy loss. Experimental results confirm that the proposed UDSE model can effectively enhance speech degraded by various conventional and unconventional distortions, e.g., additive noise, reverberation, band limitation, clipping, phase distortion, and compression distortion, as well as their combinations. These results demonstrate the superior universality and practicality of UDSE compared to advanced regression-based SE methods.

Paper Structure

This paper contains 24 sections, 16 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the proposed UDSE's inference process.
  • Figure 2: Overview of the proposed UDSE's training process.
  • Figure 3: Spectrogram comparison among degraded speech, clean speech, and speeches enhanced by the baseline with the best objective scores and UDSE for conventional DN, DR and BWE tasks, respectively.
  • Figure 4: Spectrogram comparison among degraded speech, clean speech, and speeches enhanced by the baseline with the best objective scores and UDSE for unconventional DC, PDR and CDR tasks, respectively.
  • Figure 5: Spectrogram comparison among degraded speech, clean speech, and speeches enhanced by the baseline with the best objective scores and UDSE for mixed DN+DR+BWE, DN+DR+DC and DN+PDR+CDR tasks, respectively.