Table of Contents
Fetching ...

Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder

Xuyuan Li, Zengqiang Shang, Peiyang Shi, Hua Hua, Ta Li, Pengyuan Zhang

TL;DR

This work tackles the challenge of expressive paragraph-level speech synthesis for audiobooks by extending VITS with a five-level hierarchical variational autoencoder (frame to paragraph). The proposed EP-MSTTS uses separate Multi-step Audio/Text Encoders and a Multi-step Decoder to model intra-paragraph stylistic variation while mitigating posterior collapse through a staged KL-annealing strategy and a parallel linear-spectrogram predictor. Trained on paragraph-sliced French audiobook data, EP-MSTTS outperforms sentence-level and hierarchical baselines in MOS and objective metrics (MCD, log-F0 RMSE), with ablations confirming the value of each hierarchical component and the training strategy. The approach enables smoother, more coherent long-form speech with expressive variation, advancing practical audiobook synthesis and long-form TTS applications.

Abstract

Neural networks have been able to generate high-quality single-sentence speech. However, it remains a challenge concerning audio-book speech synthesis due to the intra-paragraph correlation of semantic and acoustic features as well as variable styles. In this paper, we propose a highly expressive paragraph speech synthesis system with a multi-step variational autoencoder, called EP-MSTTS. EP-MSTTS is the first VITS-based paragraph speech synthesis model and models the variable style of paragraph speech at five levels: frame, phoneme, word, sentence, and paragraph. We also propose a series of improvements to enhance the performance of this hierarchical model. In addition, we directly train EP-MSTTS on speech sliced by paragraph rather than sentence. Experiment results on the single-speaker French audiobook corpus released at Blizzard Challenge 2023 show EP-MSTTS obtains better performance than baseline models.

Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder

TL;DR

This work tackles the challenge of expressive paragraph-level speech synthesis for audiobooks by extending VITS with a five-level hierarchical variational autoencoder (frame to paragraph). The proposed EP-MSTTS uses separate Multi-step Audio/Text Encoders and a Multi-step Decoder to model intra-paragraph stylistic variation while mitigating posterior collapse through a staged KL-annealing strategy and a parallel linear-spectrogram predictor. Trained on paragraph-sliced French audiobook data, EP-MSTTS outperforms sentence-level and hierarchical baselines in MOS and objective metrics (MCD, log-F0 RMSE), with ablations confirming the value of each hierarchical component and the training strategy. The approach enables smoother, more coherent long-form speech with expressive variation, advancing practical audiobook synthesis and long-form TTS applications.

Abstract

Neural networks have been able to generate high-quality single-sentence speech. However, it remains a challenge concerning audio-book speech synthesis due to the intra-paragraph correlation of semantic and acoustic features as well as variable styles. In this paper, we propose a highly expressive paragraph speech synthesis system with a multi-step variational autoencoder, called EP-MSTTS. EP-MSTTS is the first VITS-based paragraph speech synthesis model and models the variable style of paragraph speech at five levels: frame, phoneme, word, sentence, and paragraph. We also propose a series of improvements to enhance the performance of this hierarchical model. In addition, we directly train EP-MSTTS on speech sliced by paragraph rather than sentence. Experiment results on the single-speaker French audiobook corpus released at Blizzard Challenge 2023 show EP-MSTTS obtains better performance than baseline models.
Paper Structure (15 sections, 4 equations, 3 figures, 2 tables)

This paper contains 15 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: EP-MSTTS architecture including MSAE, MSTE, and MSD. The DS and US denote "Downsampling" and ”Upsampling" operations. The PoEnc and PriEnc denote "Posterior Encoder" and "Prior Encoder". "$\oplus$" represents concatenating two tensors.
  • Figure 2: Probability distribution of paragraph duration length.
  • Figure 3: Speech of three continuous sentences generated by VITS-U-S (top) and VITS-U (bottom).