Table of Contents
Fetching ...

Bayesian Speech synthesizers Can Learn from Multiple Teachers

Ziyang Zhang, Yifan Gao, Xuenan Xu, Baoxiangli, Wen Wu, Chao Zhang

TL;DR

BELLE introduces Bayesian evidential learning for a continuous-valued autoregressive TTS framework that directly predicts mel-spectrogram frames with uncertainty, enabling principled sampling. It couples Normal-Inverse-Gamma hierarchical sampling with an autoregressive Transformer and a multi-teacher knowledge distillation strategy to leverage synthetic data from several public TTS models. Empirically, BELLE achieves competitive speech naturalness and speaker similarity with roughly one-tenth the real-data size of strong baselines, and its streaming variant BELLE-stream demonstrates low-latency generation without major quality loss. The work highlights the benefits of Bayesian sampling over Gaussian sampling and shows that multi-teacher augmentation yields robust improvements, offering a practical path toward high-quality zero-shot TTS with uncertainty control.

Abstract

Codec-based text-to-speech (TTS) models have recently gained traction for their efficiency and strong performance in voice cloning. However, codec-based TTS faces limitations due to the challenges of pretraining robust speech codecs and the quality degradation introduced by quantization errors. Emerging evidence suggests that continuous-valued generative models can alleviate these issues and serve as a promising alternative. Yet, effectively modelling diverse speech patterns and developing reliable sampling strategies for continuous-valued autoregressive (AR) TTS remains underexplored. In this work, we propose BELLE, Bayesian evidential learning with language modelling for TTS, a novel continuous-valued AR framework that directly predicts mel-spectrograms from textual input. BELLE treats each mel-spectrogram frame as a Gaussian distribution sampled from a learned hyper distribution, enabling principled uncertainty estimation, particularly in scenarios with parallel data (i.e., one text-audio prompt paired with multiple speech samples). To obtain such data, diverse speech samples are synthesized using multiple pre-trained TTS models given the same text-audio prompts, which are distilled into BELLE via Bayesian evidential learning. Experimental results indicate that BELLE demonstrates highly competitive performance compared with the current best open-source TTS models, even though BELLE is trained on a large amount of synthetic data and uses only approximately one-tenth of their training data. Audio samples generated by BELLE are available at https://belletts.github.io/Belle/. The code, checkpoints, and synthetic data will be released after the paper is accepted.

Bayesian Speech synthesizers Can Learn from Multiple Teachers

TL;DR

BELLE introduces Bayesian evidential learning for a continuous-valued autoregressive TTS framework that directly predicts mel-spectrogram frames with uncertainty, enabling principled sampling. It couples Normal-Inverse-Gamma hierarchical sampling with an autoregressive Transformer and a multi-teacher knowledge distillation strategy to leverage synthetic data from several public TTS models. Empirically, BELLE achieves competitive speech naturalness and speaker similarity with roughly one-tenth the real-data size of strong baselines, and its streaming variant BELLE-stream demonstrates low-latency generation without major quality loss. The work highlights the benefits of Bayesian sampling over Gaussian sampling and shows that multi-teacher augmentation yields robust improvements, offering a practical path toward high-quality zero-shot TTS with uncertainty control.

Abstract

Codec-based text-to-speech (TTS) models have recently gained traction for their efficiency and strong performance in voice cloning. However, codec-based TTS faces limitations due to the challenges of pretraining robust speech codecs and the quality degradation introduced by quantization errors. Emerging evidence suggests that continuous-valued generative models can alleviate these issues and serve as a promising alternative. Yet, effectively modelling diverse speech patterns and developing reliable sampling strategies for continuous-valued autoregressive (AR) TTS remains underexplored. In this work, we propose BELLE, Bayesian evidential learning with language modelling for TTS, a novel continuous-valued AR framework that directly predicts mel-spectrograms from textual input. BELLE treats each mel-spectrogram frame as a Gaussian distribution sampled from a learned hyper distribution, enabling principled uncertainty estimation, particularly in scenarios with parallel data (i.e., one text-audio prompt paired with multiple speech samples). To obtain such data, diverse speech samples are synthesized using multiple pre-trained TTS models given the same text-audio prompts, which are distilled into BELLE via Bayesian evidential learning. Experimental results indicate that BELLE demonstrates highly competitive performance compared with the current best open-source TTS models, even though BELLE is trained on a large amount of synthetic data and uses only approximately one-tenth of their training data. Audio samples generated by BELLE are available at https://belletts.github.io/Belle/. The code, checkpoints, and synthetic data will be released after the paper is accepted.

Paper Structure

This paper contains 39 sections, 21 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The structure of BELLE and the detailed sampling module. The output is assumed to follow a Normal-Inverse-Gamma (NIG) distribution, and the Sampling Module predicts four distribution parameters. Sequentially, variance and mean are obtained via Inverse-Gamma sampling and Gaussian sampling, respectively, followed by a final Gaussian sampling step to generate the output $\boldsymbol{y}_t^{(1)}$. BELLE-stream adopts the same architecture as BELLE. The text and Mel-spectrogram are split into chunks and interleaved to form the input sequence. Typically, the final chunk generates all remaining audio, making it longer than the preceding ones.
  • Figure 2: Screenshots of subjective evaluations.