Table of Contents
Fetching ...

Wave-Trainer-Fit: Neural Vocoder with Trainable Prior and Fixed-Point Iteration towards High-Quality Speech Generation from SSL features

Hien Ohnaka, Yuma Shirahata, Masaya Kawamura

TL;DR

WaveTrainerFit advances neural vocoding for SSL features by introducing a VAE-based trainable prior and a posterior-guided energy constraint within a fixed-point diffusion-style framework. By sampling noise in the time-frequency domain and enforcing reference-aware gain, the method achieves higher fidelity and speaker similarity with fewer inference steps than WaveFit. Experiments on LibriTTS-R across multiple SSL layers demonstrate robust improvements in objective and subjective metrics, including resilience to the depth of SSL features. The approach maintains a compact model size and demonstrates practical potential for high-quality, SSL-conditioned speech synthesis with efficient inference. Code and models are publicly available to facilitate reproducibility and deployment.

Abstract

We propose WaveTrainerFit, a neural vocoder that performs high-quality waveform generation from data-driven features such as SSL features. WaveTrainerFit builds upon the WaveFit vocoder, which integrates diffusion model and generative adversarial network. Furthermore, the proposed method incorporates the following key improvements: 1. By introducing trainable priors, the inference process starts from noise close to the target speech instead of Gaussian noise. 2. Reference-aware gain adjustment is performed by imposing constraints on the trainable prior to matching the speech energy. These improvements are expected to reduce the complexity of waveform modeling from data-driven features, enabling high-quality waveform generation with fewer inference steps. Through experiments, we showed that WaveTrainerFit can generate highly natural waveforms with improved speaker similarity from data-driven features, while requiring fewer iterations than WaveFit. Moreover, we showed that the proposed method works robustly with respect to the depth at which SSL features are extracted. Code and pre-trained models are available from https://github.com/line/WaveTrainerFit.

Wave-Trainer-Fit: Neural Vocoder with Trainable Prior and Fixed-Point Iteration towards High-Quality Speech Generation from SSL features

TL;DR

WaveTrainerFit advances neural vocoding for SSL features by introducing a VAE-based trainable prior and a posterior-guided energy constraint within a fixed-point diffusion-style framework. By sampling noise in the time-frequency domain and enforcing reference-aware gain, the method achieves higher fidelity and speaker similarity with fewer inference steps than WaveFit. Experiments on LibriTTS-R across multiple SSL layers demonstrate robust improvements in objective and subjective metrics, including resilience to the depth of SSL features. The approach maintains a compact model size and demonstrates practical potential for high-quality, SSL-conditioned speech synthesis with efficient inference. Code and models are publicly available to facilitate reproducibility and deployment.

Abstract

We propose WaveTrainerFit, a neural vocoder that performs high-quality waveform generation from data-driven features such as SSL features. WaveTrainerFit builds upon the WaveFit vocoder, which integrates diffusion model and generative adversarial network. Furthermore, the proposed method incorporates the following key improvements: 1. By introducing trainable priors, the inference process starts from noise close to the target speech instead of Gaussian noise. 2. Reference-aware gain adjustment is performed by imposing constraints on the trainable prior to matching the speech energy. These improvements are expected to reduce the complexity of waveform modeling from data-driven features, enabling high-quality waveform generation with fewer inference steps. Through experiments, we showed that WaveTrainerFit can generate highly natural waveforms with improved speaker similarity from data-driven features, while requiring fewer iterations than WaveFit. Moreover, we showed that the proposed method works robustly with respect to the depth at which SSL features are extracted. Code and pre-trained models are available from https://github.com/line/WaveTrainerFit.
Paper Structure (16 sections, 8 equations, 3 figures, 2 tables)

This paper contains 16 sections, 8 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Conceptual diagrams and noise examples of each methods. The bottom images show the log Mel-spectrograms of initial noise.
  • Figure 2: Overview of the proposed model. During training, the posterior encoder derived from the target waveform and the SSL feature is used for noise sampling and gain adjustment. During inference, the prior encoder derived from the SSL feature is used for same process. Solid arrows are used for both training and inference.
  • Figure 3: Line plots of objective metrics over iterations. Real-Time Factor (RTF) was measured on "Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz" using randomly selected 200 samples.