Table of Contents
Fetching ...

Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks

Mahmoud Salhab, Haidar Harmanani

TL;DR

The paper addresses speech bandwidth expansion by learning an end-to-end high-fidelity GAN that maps narrowband to wideband speech across multiple upsampling ratios $oldsymbol{s}>1$. It introduces a unified generator-discriminator architecture inspired by HiFi-GAN, utilizing a generator with transposed convolutions and multi-receptive-field blocks, along with MSD and MPD discriminators, trained with adversarial, mel-spectrogram reconstruction, and feature-matching losses. Key contributions include a single model capable of handling several upsampling ratios, zero-shot generalization to unseen ratios, and empirical improvements over end-to-end baselines while remaining competitive with cascaded NVSR methods, particularly at higher ratios, evaluated on the VCTK dataset using LSD as the metric. The results suggest practical benefits for real-world speech enhancement tasks, offering a simpler, scalable approach with robust performance across varying bandwidth expansion factors.

Abstract

Speech bandwidth expansion is crucial for expanding the frequency range of low-bandwidth speech signals, thereby improving audio quality, clarity and perceptibility in digital applications. Its applications span telephony, compression, text-to-speech synthesis, and speech recognition. This paper presents a novel approach using a high-fidelity generative adversarial network, unlike cascaded systems, our system is trained end-to-end on paired narrowband and wideband speech signals. Our method integrates various bandwidth upsampling ratios into a single unified model specifically designed for speech bandwidth expansion applications. Our approach exhibits robust performance across various bandwidth expansion factors, including those not encountered during training, demonstrating zero-shot capability. To the best of our knowledge, this is the first work to showcase this capability. The experimental results demonstrate that our method outperforms previous end-to-end approaches, as well as interpolation and traditional techniques, showcasing its effectiveness in practical speech enhancement applications.

Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks

TL;DR

The paper addresses speech bandwidth expansion by learning an end-to-end high-fidelity GAN that maps narrowband to wideband speech across multiple upsampling ratios . It introduces a unified generator-discriminator architecture inspired by HiFi-GAN, utilizing a generator with transposed convolutions and multi-receptive-field blocks, along with MSD and MPD discriminators, trained with adversarial, mel-spectrogram reconstruction, and feature-matching losses. Key contributions include a single model capable of handling several upsampling ratios, zero-shot generalization to unseen ratios, and empirical improvements over end-to-end baselines while remaining competitive with cascaded NVSR methods, particularly at higher ratios, evaluated on the VCTK dataset using LSD as the metric. The results suggest practical benefits for real-world speech enhancement tasks, offering a simpler, scalable approach with robust performance across varying bandwidth expansion factors.

Abstract

Speech bandwidth expansion is crucial for expanding the frequency range of low-bandwidth speech signals, thereby improving audio quality, clarity and perceptibility in digital applications. Its applications span telephony, compression, text-to-speech synthesis, and speech recognition. This paper presents a novel approach using a high-fidelity generative adversarial network, unlike cascaded systems, our system is trained end-to-end on paired narrowband and wideband speech signals. Our method integrates various bandwidth upsampling ratios into a single unified model specifically designed for speech bandwidth expansion applications. Our approach exhibits robust performance across various bandwidth expansion factors, including those not encountered during training, demonstrating zero-shot capability. To the best of our knowledge, this is the first work to showcase this capability. The experimental results demonstrate that our method outperforms previous end-to-end approaches, as well as interpolation and traditional techniques, showcasing its effectiveness in practical speech enhancement applications.
Paper Structure (13 sections, 9 equations, 3 figures, 1 table)

This paper contains 13 sections, 9 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Complete architecture of the model.
  • Figure 2: Spectrogram Analysis of Narrowband to Wideband Speech Reconstruction with Varying Upsampling Ratios ($\mathbf{s}=8$, $\mathbf{s}=4$, $\mathbf{s}=2$)
  • Figure 3: Performance comparison of our unified model across various upsampling ratios, demonstrating its ability to handle unseen upsampling ratios with maintained low Log Spectral Distance (LSD) compared to traditional interpolation methods