Table of Contents
Fetching ...

Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation

Reo Yoneyama, Atsushi Miyashita, Ryuichi Yamamoto, Tomoki Toda

TL;DR

This work identifies aliasing as a fundamental bottleneck in neural vocoders, especially under high-F0 extrapolation. It proposes Wavehax, a GAN-based vocoder that performs aliasing-free waveform synthesis by estimating complex spectrograms from a harmonic-prior-informed 2D time–frequency representation, using 2D CNNs and ConvNeXt blocks with iSTFT reconstruction. The harmonic prior provides a strong inductive bias for robust harmonic content, enabling high-quality synthesis with much greater efficiency than state-of-the-art time-domain vocoders and superior robustness to unseen F0 values. Experimental results on the JVS dataset show Wavehax achieving competitive perceptual quality while drastically reducing parameters and MACs, and exhibiting notable resilience to high-F0 scenarios where aliasing typically degrades performance.

Abstract

Neural vocoders often struggle with aliasing in latent feature spaces, caused by time-domain nonlinear operations and resampling layers. Aliasing folds high-frequency components into the low-frequency range, making aliased and original frequency components indistinguishable and introducing two practical issues. First, aliasing complicates the waveform generation process, as the subsequent layers must address these aliasing effects, increasing the computational complexity. Second, it limits extrapolation performance, particularly in handling high fundamental frequencies, which degrades the perceptual quality of generated speech waveforms. This paper demonstrates that 1) time-domain nonlinear operations inevitably introduce aliasing but provide a strong inductive bias for harmonic generation, and 2) time-frequency-domain processing can achieve aliasing-free waveform synthesis but lacks the inductive bias for effective harmonic generation. Building on this insight, we propose Wavehax, an aliasing-free neural WAVEform generator that integrates 2D convolution and a HArmonic prior for reliable Complex Spectrogram estimation. Experimental results show that Wavehax achieves speech quality comparable to existing high-fidelity neural vocoders and exhibits exceptional robustness in scenarios requiring high fundamental frequency extrapolation, where aliasing effects become typically severe. Moreover, Wavehax requires less than 5% of the multiply-accumulate operations and model parameters compared to HiFi-GAN V1, while achieving over four times faster CPU inference speed.

Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation

TL;DR

This work identifies aliasing as a fundamental bottleneck in neural vocoders, especially under high-F0 extrapolation. It proposes Wavehax, a GAN-based vocoder that performs aliasing-free waveform synthesis by estimating complex spectrograms from a harmonic-prior-informed 2D time–frequency representation, using 2D CNNs and ConvNeXt blocks with iSTFT reconstruction. The harmonic prior provides a strong inductive bias for robust harmonic content, enabling high-quality synthesis with much greater efficiency than state-of-the-art time-domain vocoders and superior robustness to unseen F0 values. Experimental results on the JVS dataset show Wavehax achieving competitive perceptual quality while drastically reducing parameters and MACs, and exhibiting notable resilience to high-F0 scenarios where aliasing typically degrades performance.

Abstract

Neural vocoders often struggle with aliasing in latent feature spaces, caused by time-domain nonlinear operations and resampling layers. Aliasing folds high-frequency components into the low-frequency range, making aliased and original frequency components indistinguishable and introducing two practical issues. First, aliasing complicates the waveform generation process, as the subsequent layers must address these aliasing effects, increasing the computational complexity. Second, it limits extrapolation performance, particularly in handling high fundamental frequencies, which degrades the perceptual quality of generated speech waveforms. This paper demonstrates that 1) time-domain nonlinear operations inevitably introduce aliasing but provide a strong inductive bias for harmonic generation, and 2) time-frequency-domain processing can achieve aliasing-free waveform synthesis but lacks the inductive bias for effective harmonic generation. Building on this insight, we propose Wavehax, an aliasing-free neural WAVEform generator that integrates 2D convolution and a HArmonic prior for reliable Complex Spectrogram estimation. Experimental results show that Wavehax achieves speech quality comparable to existing high-fidelity neural vocoders and exhibits exceptional robustness in scenarios requiring high fundamental frequency extrapolation, where aliasing effects become typically severe. Moreover, Wavehax requires less than 5% of the multiply-accumulate operations and model parameters compared to HiFi-GAN V1, while achieving over four times faster CPU inference speed.

Paper Structure

This paper contains 35 sections, 32 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Frequency domain representations of each signal in the anti-aliased nonlinear operation bigvgan are depicted. The output signal $f(\bm{x})$, shown on the right, is obtained through the process described in Eq. (\ref{['eq: anf2']}). The red lines indicate the Nyquist frequencies.
  • Figure 2: Amplitude spectra in dB of $\hat{\bm{x}}$ (blue), $\hat{\bm{a}}$ (green), and $\hat{\bm{x}} \odot \hat{\bm{a}}$ (red) are shown for a 60 Hz sinusoidal signal $\bm{x}$ when the ReLU relu function is applied. The sampling frequency of the original signal $\bm{x}$ is 1 kHz. The dotted lines mark expected harmonic frequencies from Eq. (\ref{['eq: rectified sine']}). Despite the use of an anti-aliasing nonlinear operation bigvgan, aliasing artifacts are still clearly observed in the green and red spectra.
  • Figure 3: This diagram provides an overview of Wavehax. The kernel width of the 1D convolution is set to 7, while the kernel size of the depthwise convolution is set to 7 $\times$ 7. The numbers of hidden channels, denoted as $C$ and $C'$, are set to 32 and 64, respectively. The number of frequency bins, $F$, is set to 241, calculated as half of the discrete Fourier transform points plus one. $T$ and $N$ represent the number of time steps in the waveforms and time frames in the features, respectively.
  • Figure 4: Log amplitude spectra of $\bm{x}^k$ for $k$ up to 6, where $\bm{x}$ is a 60 Hz sinusoidal signal, are shown. The sampling frequency is 1 kHz, with the discrete Fourier transform over a duration of 1 second.
  • Figure 5: This figure shows the multi-resolution STFT distances averaged for each $\text{F}_0$ bin, grouped by 30 Hz intervals. Models marked with an asterisk (*) are equipped with the anti-aliased nonlinear operation bigvgan. 'w/ Sine' and 'w/ Har.' indicate models enhanced by the sinusoidal or harmonic priors, respectively. The STFT frameshift is fixed at 10 ms across all resolutions to match the temporal resolution of the $\text{F}_0$ sequences.