Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model
Jan Lehečka, Zdeněk Hanzlíček, Jindřich Matoušek, Daniel Tihelka
TL;DR
The paper investigates zero-shot versus few-shot multi-speaker TTS using a Czech SpeechT5 foundation model trained from scratch on large-scale Czech speech and text data. It pre-trains on approximately 120,000 hours of Czech speech and 17.5 billion Czech text tokens, then fine-tunes on a large, diverse multi-speaker corpus, leveraging x-vector speaker embeddings and a HiFi-GAN vocoder for synthesis. Two listening tests show zero-shot synthesis underperforms in both quality and speaker similarity, while roughly 1 minute of target speaker data yields acceptable quality, with diminishing quality gains as data increases but potential gains in similarity for less monotonous voices. The work demonstrates practical viability for Czech zero-/few-shot TTS, discusses ethical considerations for synthetic voices of public figures, and provides the pre-trained Czech SpeechT5 foundation model for non-commercial use.
Abstract
In this paper, we experimented with the SpeechT5 model pre-trained on large-scale datasets. We pre-trained the foundation model from scratch and fine-tuned it on a large-scale robust multi-speaker text-to-speech (TTS) task. We tested the model capabilities in a zero- and few-shot scenario. Based on two listening tests, we evaluated the synthetic audio quality and the similarity of how synthetic voices resemble real voices. Our results showed that the SpeechT5 model can generate a synthetic voice for any speaker using only one minute of the target speaker's data. We successfully demonstrated the high quality and similarity of our synthetic voices on publicly known Czech politicians and celebrities.
