ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

Youngwon Choi; Jinwoo Oh; Hwayeon Kim; Hyeonyu Kim

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

Youngwon Choi, Jinwoo Oh, Hwayeon Kim, Hyeonyu Kim

TL;DR

ZeSTA is proposed, a simple domain-conditioned training framework that distinguishes real and synthetic speech via a lightweight domain embedding, combined with real-data oversampling to stabilize adaptation under extremely limited target data, without modifying the base architecture.

Abstract

We investigate the use of zero-shot text-to-speech (ZS-TTS) as a data augmentation source for low-resource personalized speech synthesis. While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker similarity degradation during fine-tuning. To address this issue, we propose ZeSTA, a simple domain-conditioned training framework that distinguishes real and synthetic speech via a lightweight domain embedding, combined with real-data oversampling to stabilize adaptation under extremely limited target data, without modifying the base architecture. Experiments on LibriTTS and an in-house dataset with two ZS-TTS sources demonstrate that our approach improves speaker similarity over naive synthetic augmentation while preserving intelligibility and perceptual quality.

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

TL;DR

Abstract

Paper Structure (13 sections, 2 figures, 6 tables)

This paper contains 13 sections, 2 figures, 6 tables.

Introduction
Related Works
Method
Zero-Shot Speech Synthesis for Data Augmentation
Domain-Conditioned Training
Real-Data Oversampling
Experiments
Experimental Setup
Experimental Results
Analysis of Domain Conditioning
Speaker Consistency in Synthetic Data Augmentation
Conclusion
Generative AI Use Disclosure

Figures (2)

Figure 1: ZeSTA framework: training pipeline (a) and inference pipeline (b).
Figure 2: t-SNE visualization of latent representations for an example target speaker under different synthetic augmentation settings.

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

TL;DR

Abstract

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (2)