Table of Contents
Fetching ...

BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation

Hui-Peng Du, Ye-Xin Lu, Yang Ai, Zhen-Hua Ling

TL;DR

BiVocoder introduces a true bidirectional vocoder that performs both $STFT$-domain feature extraction and waveform reconstruction via a symmetric ConvNeXt V2-based architecture trained with adversarial and spectral losses. By producing long-frame-shift, low-dimensional features that preserve amplitude and phase, it yields high-quality analysis-synthesis results and acoustically friendly targets for TTS models. Across VCTK and LJSpeech, BiVocoder demonstrates superior synthesis quality, competitive or better MOS in TTS, and significantly faster generation on CPU, highlighting its practicality for resource-constrained deployments. The approach advances bidirectional vocoding by explicitly optimizing phase information and enabling efficient, generalizable speech generation in multiple tasks.

Abstract

This paper proposes a novel bidirectional neural vocoder, named BiVocoder, capable both of feature extraction and reverse waveform generation within the short-time Fourier transform (STFT) domain. For feature extraction, the BiVocoder takes amplitude and phase spectra derived from STFT as inputs, transforms them into long-frame-shift and low-dimensional features through convolutional neural networks. The extracted features are demonstrated suitable for direct prediction by acoustic models, supporting its application in text-to-speech (TTS) task. For waveform generation, the BiVocoder restores amplitude and phase spectra from the features by a symmetric network, followed by inverse STFT to reconstruct the speech waveform. Experimental results show that our proposed BiVocoder achieves better performance compared to some baseline vocoders, by comprehensively considering both synthesized speech quality and inference speed for both analysis-synthesis and TTS tasks.

BiVocoder: A Bidirectional Neural Vocoder Integrating Feature Extraction and Waveform Generation

TL;DR

BiVocoder introduces a true bidirectional vocoder that performs both -domain feature extraction and waveform reconstruction via a symmetric ConvNeXt V2-based architecture trained with adversarial and spectral losses. By producing long-frame-shift, low-dimensional features that preserve amplitude and phase, it yields high-quality analysis-synthesis results and acoustically friendly targets for TTS models. Across VCTK and LJSpeech, BiVocoder demonstrates superior synthesis quality, competitive or better MOS in TTS, and significantly faster generation on CPU, highlighting its practicality for resource-constrained deployments. The approach advances bidirectional vocoding by explicitly optimizing phase information and enabling efficient, generalizable speech generation in multiple tasks.

Abstract

This paper proposes a novel bidirectional neural vocoder, named BiVocoder, capable both of feature extraction and reverse waveform generation within the short-time Fourier transform (STFT) domain. For feature extraction, the BiVocoder takes amplitude and phase spectra derived from STFT as inputs, transforms them into long-frame-shift and low-dimensional features through convolutional neural networks. The extracted features are demonstrated suitable for direct prediction by acoustic models, supporting its application in text-to-speech (TTS) task. For waveform generation, the BiVocoder restores amplitude and phase spectra from the features by a symmetric network, followed by inverse STFT to reconstruct the speech waveform. Experimental results show that our proposed BiVocoder achieves better performance compared to some baseline vocoders, by comprehensively considering both synthesized speech quality and inference speed for both analysis-synthesis and TTS tasks.
Paper Structure (16 sections, 2 figures, 2 tables)

This paper contains 16 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The architecture of BiVocoder and discriminators are omitted in the diagram. $ABS(\cdot)$ and $Angle(\cdot)$ denote amplitude and phase spectrum calculations. $Arctan2$ stands for two-arguement arc-tan function. Conv1d and DeConv1d represents 1D convolutional layer and 1D deconvolutional layer, respectively.
  • Figure 2: The architecture of the ConvNeXt V2 block, where GELU, and GRN represent Gaussian error linear unit, and global response normalization, respectively.