BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation
Hui-Peng Du, Ye-Xin Lu, Yang Ai, Zhen-Hua Ling
TL;DR
BiVocoder introduces a true bidirectional vocoder that performs both $STFT$-domain feature extraction and waveform reconstruction via a symmetric ConvNeXt V2-based architecture trained with adversarial and spectral losses. By producing long-frame-shift, low-dimensional features that preserve amplitude and phase, it yields high-quality analysis-synthesis results and acoustically friendly targets for TTS models. Across VCTK and LJSpeech, BiVocoder demonstrates superior synthesis quality, competitive or better MOS in TTS, and significantly faster generation on CPU, highlighting its practicality for resource-constrained deployments. The approach advances bidirectional vocoding by explicitly optimizing phase information and enabling efficient, generalizable speech generation in multiple tasks.
Abstract
This paper proposes a novel bidirectional neural vocoder, named BiVocoder, capable both of feature extraction and reverse waveform generation within the short-time Fourier transform (STFT) domain. For feature extraction, the BiVocoder takes amplitude and phase spectra derived from STFT as inputs, transforms them into long-frame-shift and low-dimensional features through convolutional neural networks. The extracted features are demonstrated suitable for direct prediction by acoustic models, supporting its application in text-to-speech (TTS) task. For waveform generation, the BiVocoder restores amplitude and phase spectra from the features by a symmetric network, followed by inverse STFT to reconstruct the speech waveform. Experimental results show that our proposed BiVocoder achieves better performance compared to some baseline vocoders, by comprehensively considering both synthesized speech quality and inference speed for both analysis-synthesis and TTS tasks.
