Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation

Reo Yoneyama; Atsushi Miyashita; Ryuichi Yamamoto; Tomoki Toda

Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation

Reo Yoneyama, Atsushi Miyashita, Ryuichi Yamamoto, Tomoki Toda

TL;DR

This work identifies aliasing as a fundamental bottleneck in neural vocoders, especially under high-F0 extrapolation. It proposes Wavehax, a GAN-based vocoder that performs aliasing-free waveform synthesis by estimating complex spectrograms from a harmonic-prior-informed 2D time–frequency representation, using 2D CNNs and ConvNeXt blocks with iSTFT reconstruction. The harmonic prior provides a strong inductive bias for robust harmonic content, enabling high-quality synthesis with much greater efficiency than state-of-the-art time-domain vocoders and superior robustness to unseen F0 values. Experimental results on the JVS dataset show Wavehax achieving competitive perceptual quality while drastically reducing parameters and MACs, and exhibiting notable resilience to high-F0 scenarios where aliasing typically degrades performance.

Abstract

Neural vocoders often struggle with aliasing in latent feature spaces, caused by time-domain nonlinear operations and resampling layers. Aliasing folds high-frequency components into the low-frequency range, making aliased and original frequency components indistinguishable and introducing two practical issues. First, aliasing complicates the waveform generation process, as the subsequent layers must address these aliasing effects, increasing the computational complexity. Second, it limits extrapolation performance, particularly in handling high fundamental frequencies, which degrades the perceptual quality of generated speech waveforms. This paper demonstrates that 1) time-domain nonlinear operations inevitably introduce aliasing but provide a strong inductive bias for harmonic generation, and 2) time-frequency-domain processing can achieve aliasing-free waveform synthesis but lacks the inductive bias for effective harmonic generation. Building on this insight, we propose Wavehax, an aliasing-free neural WAVEform generator that integrates 2D convolution and a HArmonic prior for reliable Complex Spectrogram estimation. Experimental results show that Wavehax achieves speech quality comparable to existing high-fidelity neural vocoders and exhibits exceptional robustness in scenarios requiring high fundamental frequency extrapolation, where aliasing effects become typically severe. Moreover, Wavehax requires less than 5% of the multiply-accumulate operations and model parameters compared to HiFi-GAN V1, while achieving over four times faster CPU inference speed.

Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation

TL;DR

Abstract

Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)