Table of Contents
Fetching ...

SponTTS: modeling and transferring spontaneous style for TTS

Hanzhao Li, Xinfa Zhu, Liumeng Xue, Yang Song, Yunlin Chen, Lei Xie

TL;DR

SponTTS introduces a two-stage, BN-based framework to model and transfer spontaneous speech style for TTS. The Text2BN stage uses a CVAE with a spontaneous phenomena embedding constraint and a flow-based prior to capture latent spontaneous prosody from BN features, while the BN2Wave stage adopts a VITS-like generator to synthesize waveform conditioned on BN and speaker embeddings. Experiments show improved prosodic variability, naturalness, and speaker similarity for seen and unseen speakers, with strong zero-shot transfer capabilities. This approach mitigates data limitations for spontaneous speech by leveraging bottleneck features and latent style transfer, enabling more natural and expressive TTS. The work provides practical pathways for spontaneous speech generation in data-constrained scenarios and offers robust transfer to new speakers without spontaneous data.

Abstract

Spontaneous speaking style exhibits notable differences from other speaking styles due to various spontaneous phenomena (e.g., filled pauses, prolongation) and substantial prosody variation (e.g., diverse pitch and duration variation, occasional non-verbal speech like a smile), posing challenges to modeling and prediction of spontaneous style. Moreover, the limitation of high-quality spontaneous data constrains spontaneous speech generation for speakers without spontaneous data. To address these problems, we propose SponTTS, a two-stage approach based on neural bottleneck (BN) features to model and transfer spontaneous style for TTS. In the first stage, we adopt a Conditional Variational Autoencoder (CVAE) to capture spontaneous prosody from a BN feature and involve the spontaneous phenomena by the constraint of spontaneous phenomena embedding prediction loss. Besides, we introduce a flow-based predictor to predict a latent spontaneous style representation from the text, which enriches the prosody and context-specific spontaneous phenomena during inference. In the second stage, we adopt a VITS-like module to transfer the spontaneous style learned in the first stage to the target speakers. Experiments demonstrate that SponTTS is effective in modeling spontaneous style and transferring the style to the target speakers, generating spontaneous speech with high naturalness, expressiveness, and speaker similarity. The zero-shot spontaneous style TTS test further verifies the generalization and robustness of SponTTS in generating spontaneous speech for unseen speakers.

SponTTS: modeling and transferring spontaneous style for TTS

TL;DR

SponTTS introduces a two-stage, BN-based framework to model and transfer spontaneous speech style for TTS. The Text2BN stage uses a CVAE with a spontaneous phenomena embedding constraint and a flow-based prior to capture latent spontaneous prosody from BN features, while the BN2Wave stage adopts a VITS-like generator to synthesize waveform conditioned on BN and speaker embeddings. Experiments show improved prosodic variability, naturalness, and speaker similarity for seen and unseen speakers, with strong zero-shot transfer capabilities. This approach mitigates data limitations for spontaneous speech by leveraging bottleneck features and latent style transfer, enabling more natural and expressive TTS. The work provides practical pathways for spontaneous speech generation in data-constrained scenarios and offers robust transfer to new speakers without spontaneous data.

Abstract

Spontaneous speaking style exhibits notable differences from other speaking styles due to various spontaneous phenomena (e.g., filled pauses, prolongation) and substantial prosody variation (e.g., diverse pitch and duration variation, occasional non-verbal speech like a smile), posing challenges to modeling and prediction of spontaneous style. Moreover, the limitation of high-quality spontaneous data constrains spontaneous speech generation for speakers without spontaneous data. To address these problems, we propose SponTTS, a two-stage approach based on neural bottleneck (BN) features to model and transfer spontaneous style for TTS. In the first stage, we adopt a Conditional Variational Autoencoder (CVAE) to capture spontaneous prosody from a BN feature and involve the spontaneous phenomena by the constraint of spontaneous phenomena embedding prediction loss. Besides, we introduce a flow-based predictor to predict a latent spontaneous style representation from the text, which enriches the prosody and context-specific spontaneous phenomena during inference. In the second stage, we adopt a VITS-like module to transfer the spontaneous style learned in the first stage to the target speakers. Experiments demonstrate that SponTTS is effective in modeling spontaneous style and transferring the style to the target speakers, generating spontaneous speech with high naturalness, expressiveness, and speaker similarity. The zero-shot spontaneous style TTS test further verifies the generalization and robustness of SponTTS in generating spontaneous speech for unseen speakers.
Paper Structure (12 sections, 10 equations, 3 figures, 4 tables)

This paper contains 12 sections, 10 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The overview of SponTTS
  • Figure 2: The structure of Text2BN
  • Figure 3: The structure of BN2Wave