Table of Contents
Fetching ...

Universal Score-based Speech Enhancement with High Content Preservation

Robin Scheibler, Yusuke Fujita, Yuma Shirahata, Tatsuya Komatsu

TL;DR

UNIVERSE++ addresses universal speech enhancement by combining score-based diffusion with adversarial training and content-preserving fine-tuning. It extends UNIVERSE with architectural normalization and anti-aliasing, a HiFi-GAN adversarial loss, and a LoRA-based phoneme fidelity objective to reduce hallucinations and preserve linguistic content. The diffusion process uses a time-dependent noise-variance schedule $σ_t^2$, learned via a score-matching objective and refined with perceptual and phoneme-based losses. Trained on a large-scale, variably degraded corpus and evaluated across diverse benchmarks, it achieves superior naturalness and competitive intelligibility and content metrics compared with discriminative and generative baselines. The work provides open-source implementations and shows robust performance in real-world distortions.

Abstract

We propose UNIVERSE++, a universal speech enhancement method based on score-based diffusion and adversarial training. Specifically, we improve the existing UNIVERSE model that decouples clean speech feature extraction and diffusion. Our contributions are three-fold. First, we make several modifications to the network architecture, improving training stability and final performance. Second, we introduce an adversarial loss to promote learning high quality speech features. Third, we propose a low-rank adaptation scheme with a phoneme fidelity loss to improve content preservation in the enhanced speech. In the experiments, we train a universal enhancement model on a large scale dataset of speech degraded by noise, reverberation, and various distortions. The results on multiple public benchmark datasets demonstrate that UNIVERSE++ compares favorably to both discriminative and generative baselines for a wide range of qualitative and intelligibility metrics.

Universal Score-based Speech Enhancement with High Content Preservation

TL;DR

UNIVERSE++ addresses universal speech enhancement by combining score-based diffusion with adversarial training and content-preserving fine-tuning. It extends UNIVERSE with architectural normalization and anti-aliasing, a HiFi-GAN adversarial loss, and a LoRA-based phoneme fidelity objective to reduce hallucinations and preserve linguistic content. The diffusion process uses a time-dependent noise-variance schedule , learned via a score-matching objective and refined with perceptual and phoneme-based losses. Trained on a large-scale, variably degraded corpus and evaluated across diverse benchmarks, it achieves superior naturalness and competitive intelligibility and content metrics compared with discriminative and generative baselines. The work provides open-source implementations and shows robust performance in real-world distortions.

Abstract

We propose UNIVERSE++, a universal speech enhancement method based on score-based diffusion and adversarial training. Specifically, we improve the existing UNIVERSE model that decouples clean speech feature extraction and diffusion. Our contributions are three-fold. First, we make several modifications to the network architecture, improving training stability and final performance. Second, we introduce an adversarial loss to promote learning high quality speech features. Third, we propose a low-rank adaptation scheme with a phoneme fidelity loss to improve content preservation in the enhanced speech. In the experiments, we train a universal enhancement model on a large scale dataset of speech degraded by noise, reverberation, and various distortions. The results on multiple public benchmark datasets demonstrate that UNIVERSE++ compares favorably to both discriminative and generative baselines for a wide range of qualitative and intelligibility metrics.
Paper Structure (15 sections, 5 equations, 5 figures)

This paper contains 15 sections, 5 equations, 5 figures.

Figures (5)

  • Figure 1: Overview of the UNIVERSE network architecture with proposed (purple box) and original (yellow box) loss functions.
  • Figure 2: Proposed anti-aliasing filters at down/up-sampling stages of the score network UNet.
  • Figure 3: Results on the Voicebank-DEMAND dataset at 16kHz. The values for HiFi++ and SGMSE+M are those reported in andreev_hifi_2023 and lemercier_storm_2023, respectively.
  • Figure 4: Results on the Voicebank+DEMAND (VB) dataset, VB low-pass filtered at 4kHz (VB-BWE), and packet loss concealment challenge validation set (PLC). Arrows indicate if the metric is better when increasing ($\uparrow$) or decreasing ($\downarrow$).
  • Figure 5: Results on the Signal Improvement Challenge non-blind test set, for which reference clean speech is not available.