Table of Contents
Fetching ...

HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution

Shengkui Zhao, Kun Zhou, Zexu Pan, Yukun Ma, Chong Zhang, Bin Ma

TL;DR

The paper tackles the problem of high-fidelity speech super-resolution from low-sampling-rate inputs and the inconsistencies of two-stage pipelines. It introduces HiFi-SR, a unified end-to-end GAN that couples a transformer-based encoder (MossFormer2) with a convolutional waveform generator (HiFi-GAN–style) to produce 48 kHz outputs from 4–32 kHz inputs. The model employs a multi-band time-frequency discriminator and a multi-scale mel-spectrogram loss, combined with a feature matching term, formalized as $\mathcal{L}_G=\mathcal{L}_{Adv}(G)+\lambda_m\mathcal{L}_{m}(G)+\lambda_f\mathcal{L}_f(G)$ with $\lambda_m=7$ and $\lambda_f=1.5$, and $\mathcal{L}_D=\mathcal{L}_{Adv}(D)$. Trained on VCTK with additional EXPRESSO and VocalSet tests, HiFi-SR achieves lower LSD (0.82) than prior methods and shows superior ABX preferences, especially in out-of-domain data, indicating improved generalization. The work demonstrates that a unified, end-to-end adversarial framework can outperform modular SR approaches by aligning latent prediction and waveform reconstruction, with practical impact for high-quality 48 kHz speech in diverse applications.

Abstract

The application of generative adversarial networks (GANs) has recently advanced speech super-resolution (SR) based on intermediate representations like mel-spectrograms. However, existing SR methods that typically rely on independently trained and concatenated networks may lead to inconsistent representations and poor speech quality, especially in out-of-domain scenarios. In this work, we propose HiFi-SR, a unified network that leverages end-to-end adversarial training to achieve high-fidelity speech super-resolution. Our model features a unified transformer-convolutional generator designed to seamlessly handle both the prediction of latent representations and their conversion into time-domain waveforms. The transformer network serves as a powerful encoder, converting low-resolution mel-spectrograms into latent space representations, while the convolutional network upscales these representations into high-resolution waveforms. To enhance high-frequency fidelity, we incorporate a multi-band, multi-scale time-frequency discriminator, along with a multi-scale mel-reconstruction loss in the adversarial training process. HiFi-SR is versatile, capable of upscaling any input speech signal between 4 kHz and 32 kHz to a 48 kHz sampling rate. Experimental results demonstrate that HiFi-SR significantly outperforms existing speech SR methods across both objective metrics and ABX preference tests, for both in-domain and out-of-domain scenarios (https://github.com/modelscope/ClearerVoice-Studio).

HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution

TL;DR

The paper tackles the problem of high-fidelity speech super-resolution from low-sampling-rate inputs and the inconsistencies of two-stage pipelines. It introduces HiFi-SR, a unified end-to-end GAN that couples a transformer-based encoder (MossFormer2) with a convolutional waveform generator (HiFi-GAN–style) to produce 48 kHz outputs from 4–32 kHz inputs. The model employs a multi-band time-frequency discriminator and a multi-scale mel-spectrogram loss, combined with a feature matching term, formalized as with and , and . Trained on VCTK with additional EXPRESSO and VocalSet tests, HiFi-SR achieves lower LSD (0.82) than prior methods and shows superior ABX preferences, especially in out-of-domain data, indicating improved generalization. The work demonstrates that a unified, end-to-end adversarial framework can outperform modular SR approaches by aligning latent prediction and waveform reconstruction, with practical impact for high-quality 48 kHz speech in diverse applications.

Abstract

The application of generative adversarial networks (GANs) has recently advanced speech super-resolution (SR) based on intermediate representations like mel-spectrograms. However, existing SR methods that typically rely on independently trained and concatenated networks may lead to inconsistent representations and poor speech quality, especially in out-of-domain scenarios. In this work, we propose HiFi-SR, a unified network that leverages end-to-end adversarial training to achieve high-fidelity speech super-resolution. Our model features a unified transformer-convolutional generator designed to seamlessly handle both the prediction of latent representations and their conversion into time-domain waveforms. The transformer network serves as a powerful encoder, converting low-resolution mel-spectrograms into latent space representations, while the convolutional network upscales these representations into high-resolution waveforms. To enhance high-frequency fidelity, we incorporate a multi-band, multi-scale time-frequency discriminator, along with a multi-scale mel-reconstruction loss in the adversarial training process. HiFi-SR is versatile, capable of upscaling any input speech signal between 4 kHz and 32 kHz to a 48 kHz sampling rate. Experimental results demonstrate that HiFi-SR significantly outperforms existing speech SR methods across both objective metrics and ABX preference tests, for both in-domain and out-of-domain scenarios (https://github.com/modelscope/ClearerVoice-Studio).
Paper Structure (11 sections, 5 equations, 5 figures, 1 table)

This paper contains 11 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of our proposed generative transformer-convolutional adversarial network for speech super-resolution (HiFi-SR). The transformer-convolutional generator includes a hybrid MossFormer and recurrent network followed by a reused HiFi-GAN generator. Three discriminators of MSD, MPD and MBD are combined with feature matching loss $\mathcal{L}_f$ and mel-spectrogram loss $\mathcal{L}_m$ for high-fidelity adversarial training.
  • Figure 2: Spectrogram illustrations of different system outputs for a sample input from the VocalSet singing test set. It demonstrates that HiFi-SR significantly outperforms the baseline NVSR model.
  • Figure 3: Comparison results of NVSR and HiFi-SR on EXPRESSO test set with 48 kHz target sampling rate and four input sampling rates.
  • Figure 4: Comparison results of NVSR and HiFi-SR on VocalSet test set with 48 kHz target sampling rate and four input sampling rates.
  • Figure 5: ABX subjective test results of NVSR and HiFi-SR on mixed EXPRESSO and VocalSet test set with 48 kHz target sampling rate and four input sampling rates.