Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting

Wooseok Han; Minki Kang; Changhun Kim; Eunho Yang

Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting

Wooseok Han, Minki Kang, Changhun Kim, Eunho Yang

TL;DR

Stable-TTS tackles the challenge of stable speaker-adaptive TTS with limited and noisy target data by leveraging a small set of high-quality prior samples. It integrates a diffusion-based generator with a Prosody Language Model for prior prosody prompting and a prior-preservation loss during fine-tuning to prevent overfitting. The approach combines a text encoder, prosody encoder, timbre encoder, and a diffusion model, with prior samples guiding both prosody generation and stability through a discretized prosody codebook and a PLM that uses prompts from prior data. Experimental results across LibriTTS, VCTK, and VoxCeleb show substantial improvements in intelligibility (WER reductions), naturalness (MOS), and speaker similarity (SMOS) with limited and noisy data, highlighting the method’s practical impact for low-resource or real-world voice cloning scenarios.

Abstract

Speaker-adaptive Text-to-Speech (TTS) synthesis has attracted considerable attention due to its broad range of applications, such as personalized voice assistant services. While several approaches have been proposed, they often exhibit high sensitivity to either the quantity or the quality of target speech samples. To address these limitations, we introduce Stable-TTS, a novel speaker-adaptive TTS framework that leverages a small subset of a high-quality pre-training dataset, referred to as prior samples. Specifically, Stable-TTS achieves prosody consistency by leveraging the high-quality prosody of prior samples, while effectively capturing the timbre of the target speaker. Additionally, it employs a prior-preservation loss during fine-tuning to maintain the synthesis ability for prior samples to prevent overfitting on target samples. Extensive experiments demonstrate the effectiveness of Stable-TTS even under limited amounts of and noisy target speech samples.

Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting

TL;DR

Abstract

Paper Structure (11 sections, 5 equations, 4 figures, 3 tables)

This paper contains 11 sections, 5 equations, 4 figures, 3 tables.

Introduction
Stable-TTS
Diffusion-Based Zero-Shot TTS Models
Prosody Language Model for Prior Prosody Prompting
Prior-Preservation Loss for Fine-Tuning
Experiments
Experimental Setup
Main Results
Ablation Study
Evaluation under Limited Amounts of Target Samples
Conclusion

Figures (4)

Figure 1: Concept. Our objective is to build a speaker-adaptive TTS model that utilizes prosody prompt and prior-preservation with prior samples (blue box) to generate a high-quality voice even when fine-tuning with noisy target samples (green box).
Figure 2: Overview of Stable-TTS. (a) During training, we utilize both the prosody encoder and timbre encoder to enhance timbre and ensure prosody consistency. All representations are utilized as a condition for the diffusion model. (b) During the inference phase, we leverage a Prosody Language Model (PLM) to predict the prosody code utilizing the prompt from a prior sample and generate a timbre vector by concatenating multiple mel-spectrograms derived from the same speaker.
Figure 3: Details of Fine-tuning. We train the diffusion models only using both diffusion and prior-preservation loss.
Figure 4: Zero-shot vs. Fine-tuning. t-SNE visualization (left) analysis of SECS and WER with fine-tuning iterations (right).

Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting

TL;DR

Abstract

Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting

Authors

TL;DR

Abstract

Table of Contents

Figures (4)