Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model

Jan Lehečka; Zdeněk Hanzlíček; Jindřich Matoušek; Daniel Tihelka

Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model

Jan Lehečka, Zdeněk Hanzlíček, Jindřich Matoušek, Daniel Tihelka

TL;DR

The paper investigates zero-shot versus few-shot multi-speaker TTS using a Czech SpeechT5 foundation model trained from scratch on large-scale Czech speech and text data. It pre-trains on approximately 120,000 hours of Czech speech and 17.5 billion Czech text tokens, then fine-tunes on a large, diverse multi-speaker corpus, leveraging x-vector speaker embeddings and a HiFi-GAN vocoder for synthesis. Two listening tests show zero-shot synthesis underperforms in both quality and speaker similarity, while roughly 1 minute of target speaker data yields acceptable quality, with diminishing quality gains as data increases but potential gains in similarity for less monotonous voices. The work demonstrates practical viability for Czech zero-/few-shot TTS, discusses ethical considerations for synthetic voices of public figures, and provides the pre-trained Czech SpeechT5 foundation model for non-commercial use.

Abstract

In this paper, we experimented with the SpeechT5 model pre-trained on large-scale datasets. We pre-trained the foundation model from scratch and fine-tuned it on a large-scale robust multi-speaker text-to-speech (TTS) task. We tested the model capabilities in a zero- and few-shot scenario. Based on two listening tests, we evaluated the synthetic audio quality and the similarity of how synthetic voices resemble real voices. Our results showed that the SpeechT5 model can generate a synthetic voice for any speaker using only one minute of the target speaker's data. We successfully demonstrated the high quality and similarity of our synthetic voices on publicly known Czech politicians and celebrities.

Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model

TL;DR

Abstract

Paper Structure (15 sections, 3 figures, 1 table)

This paper contains 15 sections, 3 figures, 1 table.

Introduction
SpeechT5
Pre-training
Speech Data
Text Data
Pre-training Setup
Fine-tuning
Data Preprocessing
Fine-tuning Setup
Few-Shot Speech Data Collection
Results and Discussion
Listening tests
Conclusion
Acknowledgments.
Disclosure of Interests.

Figures (3)

Figure 1: Results of the quality listening test. We show the results for the original real-human records (orig) and for the records generated using the SpeechT5 model: the zero-shot approach (zs) and few-shot approaches using 10 seconds of training data (10s), one minute (1m), and five minutes (5m) respectively.
Figure 2: Quality listening test -- results for individual voices.
Figure 3: Results of the similarity listening test.

Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model

TL;DR

Abstract

Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model

Authors

TL;DR

Abstract

Table of Contents

Figures (3)