Table of Contents
Fetching ...

JenGAN: Stacked Shifted Filters in GAN-Based Speech Synthesis

Hyunjae Cho, Junhyeok Lee, Wonbin Jung

TL;DR

JenGAN tackles audible aliasing in non-autoregressive GAN-based vocoders by introducing a training-time anti-aliasing strategy that stacks shifted low-pass filters to enforce shift-equivariance. The method applies shifted sinc filters around each block (before and after) and uses random shifts $\delta$ during training while keeping inference unchanged ($\delta=0$), with adjustments for up/downsampling via $r_d$. Across LJSpeech with HiFi-GAN and on other vocoders, JenGAN yields consistent improvements across objective metrics (e.g., MAE, M-STFT, PESQ, MCD, V/UV F1, pitch) and qualitative harmonic structure, especially when trained on full data. The approach preserves fast inference and supports easy fine-tuning across architectures, suggesting a practical, broad-applicability anti-aliasing training paradigm for GAN-based speech synthesis.

Abstract

Non-autoregressive GAN-based neural vocoders are widely used due to their fast inference speed and high perceptual quality. However, they often suffer from audible artifacts such as tonal artifacts in their generated results. Therefore, we propose JenGAN, a new training strategy that involves stacking shifted low-pass filters to ensure the shift-equivariant property. This method helps prevent aliasing and reduce artifacts while preserving the model structure used during inference. In our experimental evaluation, JenGAN consistently enhances the performance of vocoder models, yielding significantly superior scores across the majority of evaluation metrics.

JenGAN: Stacked Shifted Filters in GAN-Based Speech Synthesis

TL;DR

JenGAN tackles audible aliasing in non-autoregressive GAN-based vocoders by introducing a training-time anti-aliasing strategy that stacks shifted low-pass filters to enforce shift-equivariance. The method applies shifted sinc filters around each block (before and after) and uses random shifts during training while keeping inference unchanged (), with adjustments for up/downsampling via . Across LJSpeech with HiFi-GAN and on other vocoders, JenGAN yields consistent improvements across objective metrics (e.g., MAE, M-STFT, PESQ, MCD, V/UV F1, pitch) and qualitative harmonic structure, especially when trained on full data. The approach preserves fast inference and supports easy fine-tuning across architectures, suggesting a practical, broad-applicability anti-aliasing training paradigm for GAN-based speech synthesis.

Abstract

Non-autoregressive GAN-based neural vocoders are widely used due to their fast inference speed and high perceptual quality. However, they often suffer from audible artifacts such as tonal artifacts in their generated results. Therefore, we propose JenGAN, a new training strategy that involves stacking shifted low-pass filters to ensure the shift-equivariant property. This method helps prevent aliasing and reduce artifacts while preserving the model structure used during inference. In our experimental evaluation, JenGAN consistently enhances the performance of vocoder models, yielding significantly superior scores across the majority of evaluation metrics.
Paper Structure (17 sections, 3 equations, 2 figures, 3 tables, 2 algorithms)

This paper contains 17 sections, 3 equations, 2 figures, 3 tables, 2 algorithms.

Figures (2)

  • Figure 1: This figure shows the overview of the JenGAN method. We shift the signal in the input of the block by $\delta$ and in the output of the block by $-\delta$ to achieve the shift-equivariant property.
  • Figure 2: Mel-spectrograms of (a) ground truth speech and speeches generated by (b) the original HiFi-GAN model, (c) the model applying JenGAN.