Table of Contents
Fetching ...

An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder

Yicheng Gu, Xueyao Zhang, Liumeng Xue, Haizhou Li, Zhizheng Wu

TL;DR

This work tackles the limited discriminative power of STFT-based GAN vocoders by introducing two dynamic TF representations: the Multi-Scale Sub-Band Constant-Q Transform (MS-SB-CQT) discriminator and the Multi-Scale Temporal-Compressed Continuous Wavelet Transform (MS-TC-CWT) discriminator. By combining multi-scale processing, sub-band synchronization, and temporal compression, these discriminators better capture pitch and transient information. Joint training with existing STFT-based discriminators yields complementary improvements across speech and singing voice synthesis, and the approach generalizes across multiple state-of-the-art vocoders (e.g., HiFi-GAN, NSF-HiFiGAN, BigVGAN, APNet). The results show improvements in objective measures like PESQ and F0RMSE and subjective MOS, demonstrating practical gains in synthesis quality without altering generator inference, and revealing how learned representations reflect dynamic attention across frequency bands. Overall, the study provides a principled framework for integrating diverse time-frequency analyses into GAN-based vocoders to achieve higher fidelity and more stable pitch in expressive audio.

Abstract

Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constant Time-Frequency (TF) resolution, linearly scaled center frequencies, and a fixed decomposition basis, making it incompatible with signals like singing voices that require dynamic attention for different frequency bands and different time intervals. Motivated by that, we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT) discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF resolution for different frequency bands. In contrast, CQT has a better modeling ability in pitch information, and CWT has a better modeling ability in short-time transients. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed discriminators. Moreover, the STFT, CQT, and CWT-based discriminators can be used jointly for better performance. The proposed discriminators can boost the synthesis quality of various state-of-the-art GAN-based vocoders, including HiFi-GAN, BigVGAN, and APNet.

An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder

TL;DR

This work tackles the limited discriminative power of STFT-based GAN vocoders by introducing two dynamic TF representations: the Multi-Scale Sub-Band Constant-Q Transform (MS-SB-CQT) discriminator and the Multi-Scale Temporal-Compressed Continuous Wavelet Transform (MS-TC-CWT) discriminator. By combining multi-scale processing, sub-band synchronization, and temporal compression, these discriminators better capture pitch and transient information. Joint training with existing STFT-based discriminators yields complementary improvements across speech and singing voice synthesis, and the approach generalizes across multiple state-of-the-art vocoders (e.g., HiFi-GAN, NSF-HiFiGAN, BigVGAN, APNet). The results show improvements in objective measures like PESQ and F0RMSE and subjective MOS, demonstrating practical gains in synthesis quality without altering generator inference, and revealing how learned representations reflect dynamic attention across frequency bands. Overall, the study provides a principled framework for integrating diverse time-frequency analyses into GAN-based vocoders to achieve higher fidelity and more stable pitch in expressive audio.

Abstract

Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constant Time-Frequency (TF) resolution, linearly scaled center frequencies, and a fixed decomposition basis, making it incompatible with signals like singing voices that require dynamic attention for different frequency bands and different time intervals. Motivated by that, we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT) discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF resolution for different frequency bands. In contrast, CQT has a better modeling ability in pitch information, and CWT has a better modeling ability in short-time transients. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed discriminators. Moreover, the STFT, CQT, and CWT-based discriminators can be used jointly for better performance. The proposed discriminators can boost the synthesis quality of various state-of-the-art GAN-based vocoders, including HiFi-GAN, BigVGAN, and APNet.
Paper Structure (26 sections, 16 equations, 6 figures, 5 tables)

This paper contains 26 sections, 16 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of the TF resolution of the STFT, CQT, and CWT. A thinner chunk in the time/frequency axis means a better time/frequency resolution. It can be observed that the CQT and CWT spectrogram have a higher frequency resolution in low-frequency bands and a higher time resolution in high-frequency bands, while the STFT spectrogram has a fixed TF resolution across all frequency bands.
  • Figure 2: The visualization of the reconstructed square wave and the associated error function with different decomposition basis. It can be observed that the wavelet basis can reconstruct the signal with a smaller error regarding the step transient.
  • Figure 3: Architecture of the Sub-Discriminator in MS-SB-CQT Discriminator. Operator "C" denotes for concatenation. SBP means our proposed Sub-Band Processor module. It can be observed that the desynchronized CQT Spectrogram (bottom-right) has been synchronized (upper-right) after SBP.
  • Figure 4: Architecture of the Sub-Discriminator in MS-TC-CWT Discriminator. Operator "C" denotes for concatenation. TC means our proposed Temporal Compressor module. Comp is a series of temporal-overlapped convolution layers. $K$ is the total number of frequency bins. It can be observed that the CWT Spectrogram (bottom-right) can be compressed while maintaining the overall energy distribution over different frequency bins (upper-right).
  • Figure 5: The comparison of mel spectrograms from HiFi-GANs enhanced by different discriminators. "S", "C" and "W" represent MS-STFT, MS-SB-CQT and MS-TC-CWT Discriminators respectively. Integrated with three discriminators, HiFi-GAN could achieve a higher synthesis quality with more accurate harmonic tracking, fundamental frequency reconstruction, and fewer glitches.
  • ...and 1 more figures