Table of Contents
Fetching ...

Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction

Ye-Xin Lu, Yang Ai, Hui-Peng Du, Zhen-Hua Ling

TL;DR

This work tackles speech bandwidth extension by directly predicting high-frequency amplitude and phase in parallel within a GAN framework. It introduces AP-BWE, a fully convolutional, end-to-end model with a dual-stream ConvNeXt generator and multiple discriminators (MPD, MRAD, MRPD) to enforce realism in both spectrum and waveform domains. The method optimizes amplitude and phase through specialized spectrum losses, anti-wrapping phase penalties, and a complex-spectrum consistency term, enabling precise phase extension alongside amplitude. Experiments on VCTK-0.92 demonstrate state-of-the-art quality at 16 kHz and 48 kHz with exceptional generation speed, and cross-dataset tests show strong generalization, highlighting practical impact for real-time, high-fidelity wideband speech applications.

Abstract

Speech bandwidth extension (BWE) refers to widening the frequency bandwidth range of speech signals, enhancing the speech quality towards brighter and fuller. This paper proposes a generative adversarial network (GAN) based BWE model with parallel prediction of Amplitude and Phase spectra, named AP-BWE, which achieves both high-quality and efficient wideband speech waveform generation. The proposed AP-BWE generator is entirely based on convolutional neural networks (CNNs). It features a dual-stream architecture with mutual interaction, where the amplitude stream and the phase stream communicate with each other and respectively extend the high-frequency components from the input narrowband amplitude and phase spectra. To improve the naturalness of the extended speech signals, we employ a multi-period discriminator at the waveform level and design a pair of multi-resolution amplitude and phase discriminators at the spectral level, respectively. Experimental results demonstrate that our proposed AP-BWE achieves state-of-the-art performance in terms of speech quality for BWE tasks targeting sampling rates of both 16 kHz and 48 kHz. In terms of generation efficiency, due to the all-convolutional architecture and all-frame-level operations, the proposed AP-BWE can generate 48 kHz waveform samples 292.3 times faster than real-time on a single RTX 4090 GPU and 18.1 times faster than real-time on a single CPU. Notably, to our knowledge, AP-BWE is the first to achieve the direct extension of the high-frequency phase spectrum, which is beneficial for improving the effectiveness of existing BWE methods.

Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction

TL;DR

This work tackles speech bandwidth extension by directly predicting high-frequency amplitude and phase in parallel within a GAN framework. It introduces AP-BWE, a fully convolutional, end-to-end model with a dual-stream ConvNeXt generator and multiple discriminators (MPD, MRAD, MRPD) to enforce realism in both spectrum and waveform domains. The method optimizes amplitude and phase through specialized spectrum losses, anti-wrapping phase penalties, and a complex-spectrum consistency term, enabling precise phase extension alongside amplitude. Experiments on VCTK-0.92 demonstrate state-of-the-art quality at 16 kHz and 48 kHz with exceptional generation speed, and cross-dataset tests show strong generalization, highlighting practical impact for real-time, high-fidelity wideband speech applications.

Abstract

Speech bandwidth extension (BWE) refers to widening the frequency bandwidth range of speech signals, enhancing the speech quality towards brighter and fuller. This paper proposes a generative adversarial network (GAN) based BWE model with parallel prediction of Amplitude and Phase spectra, named AP-BWE, which achieves both high-quality and efficient wideband speech waveform generation. The proposed AP-BWE generator is entirely based on convolutional neural networks (CNNs). It features a dual-stream architecture with mutual interaction, where the amplitude stream and the phase stream communicate with each other and respectively extend the high-frequency components from the input narrowband amplitude and phase spectra. To improve the naturalness of the extended speech signals, we employ a multi-period discriminator at the waveform level and design a pair of multi-resolution amplitude and phase discriminators at the spectral level, respectively. Experimental results demonstrate that our proposed AP-BWE achieves state-of-the-art performance in terms of speech quality for BWE tasks targeting sampling rates of both 16 kHz and 48 kHz. In terms of generation efficiency, due to the all-convolutional architecture and all-frame-level operations, the proposed AP-BWE can generate 48 kHz waveform samples 292.3 times faster than real-time on a single RTX 4090 GPU and 18.1 times faster than real-time on a single CPU. Notably, to our knowledge, AP-BWE is the first to achieve the direct extension of the high-frequency phase spectrum, which is beneficial for improving the effectiveness of existing BWE methods.
Paper Structure (33 sections, 3 equations, 5 figures, 7 tables)

This paper contains 33 sections, 3 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: The overall structure of the proposed AP-BWE. The $\mathrm{Abs}(\cdot)$ and $\mathrm{Angle}(\cdot)$ denote the amplitude and phase calculation functions, while $\log(\cdot)$ and $\exp(\cdot)$ denote the logarithmic and exponential functions, respectively. The $\mathrm{Arctan2}$ refers to the two-argument arc-tangent function.
  • Figure 2: Details of the ConvNeXt block liu2022convnet: Each ConvNeXt block consists of a $7 \times 1$ depth-wise convolution, followed by layer normalization, a $1 \times 1$ point-wise convolution for dimensionality projection with an expansion factor of 3, a GELU activation layer, and another $1 \times 1$ point-wise convolution for dimensionality restoration followed by residual connection.
  • Figure 3: Details of the discriminators. The parameters inside the parentheses for each convolutional layer respectively represent the number of channels, kernel size, and stride.
  • Figure 4: Spectrogram visualization of the original wideband 16 kHz speech waveform and speech waveforms extended by baseline methods and our proposed AP-BWE from the source sampling rate of 2 kHz.
  • Figure 6: Spectrogram visualization of the original wideband speech waveform and speech waveforms extended by the ablation models of our proposed AP-BWE with a source sampling rate of 8 kHz and target sampling rate of 48 kHz. "AP-BWE w/o MRDs" represents the ablation of both MRAD and MRPD, while "AP-BWE w/o Disc." denotes the ablation of all discriminators.