Table of Contents
Fetching ...

Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models

Weiqin Li, Peiji Yang, Yicheng Zhong, Yixuan Zhou, Zhisheng Wang, Zhiyong Wu, Xixin Wu, Helen Meng

TL;DR

This work tackles spontaneous style TTS, a domain challenged by data quality and complex prosody. It introduces a LM-based framework built on the VALL-E valle backbone that explicitly encodes 19 spontaneous behaviors using a syntactic-aware encoder and enriches prosody with fine-grained spontaneous representations via a Spontaneous Prosody Extractor and an LM-based Prosody Predictor. The model is pre-trained on large-scale data and fine-tuned with a three-step procedure, optimizing a joint loss that combines content and spontaneous-label objectives. Empirical results on Mandarin data show substantial gains in both prosody naturalness and spontaneous-behavior naturalness, with ablations confirming the value of explicit behavior control and prosody modeling for realistic, controllable spontaneous speech synthesis.

Abstract

Spontaneous style speech synthesis, which aims to generate human-like speech, often encounters challenges due to the scarcity of high-quality data and limitations in model capabilities. Recent language model-based TTS systems can be trained on large, diverse, and low-quality speech datasets, resulting in highly natural synthesized speech. However, they are limited by the difficulty of simulating various spontaneous behaviors and capturing prosody variations in spontaneous speech. In this paper, we propose a novel spontaneous speech synthesis system based on language models. We systematically categorize and uniformly model diverse spontaneous behaviors. Moreover, fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.Experimental results show that our proposed method significantly outperforms the baseline methods in terms of prosody naturalness and spontaneous behavior naturalness.

Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models

TL;DR

This work tackles spontaneous style TTS, a domain challenged by data quality and complex prosody. It introduces a LM-based framework built on the VALL-E valle backbone that explicitly encodes 19 spontaneous behaviors using a syntactic-aware encoder and enriches prosody with fine-grained spontaneous representations via a Spontaneous Prosody Extractor and an LM-based Prosody Predictor. The model is pre-trained on large-scale data and fine-tuned with a three-step procedure, optimizing a joint loss that combines content and spontaneous-label objectives. Empirical results on Mandarin data show substantial gains in both prosody naturalness and spontaneous-behavior naturalness, with ablations confirming the value of explicit behavior control and prosody modeling for realistic, controllable spontaneous speech synthesis.

Abstract

Spontaneous style speech synthesis, which aims to generate human-like speech, often encounters challenges due to the scarcity of high-quality data and limitations in model capabilities. Recent language model-based TTS systems can be trained on large, diverse, and low-quality speech datasets, resulting in highly natural synthesized speech. However, they are limited by the difficulty of simulating various spontaneous behaviors and capturing prosody variations in spontaneous speech. In this paper, we propose a novel spontaneous speech synthesis system based on language models. We systematically categorize and uniformly model diverse spontaneous behaviors. Moreover, fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.Experimental results show that our proposed method significantly outperforms the baseline methods in terms of prosody naturalness and spontaneous behavior naturalness.
Paper Structure (18 sections, 1 equation, 5 figures, 2 tables)

This paper contains 18 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The architecture of our proposed model. Label predictor and prosody predictor outputs are used for inference.
  • Figure 2: The architecture of syntactic-aware spontaneous behavior encoder. The ${Index_{a,b}}$ represents ${a}$'s index in ${b}$, the ${Cnt_{a,b}}$ represents number of ${a}$ in the ${b}$. Subsentences are separated by punctuation.
  • Figure 3: The architecture of Spontaneous Prosody Extractor
  • Figure 4: Subjective preference test results on the preference for spontaneous style. Both are generated from the proposed method. NP represents no preference.
  • Figure 5: The mel-spectrograms and pitch contours of speech synthesized by the proposed model. The text means "um, the scenery is really beautiful." and different labels are added to "um" highlighed by the red box.