PRESENT: Zero-Shot Text-to-Prosody Control
Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman, Dorien Herremans
TL;DR
The paper tackles zero-shot text-to-prosody control by editing explicit duration, pitch, and energy predictions during inference in a FastSpeech2-based TTS model, avoiding training-time modifications. It introduces PRESENT, a framework for extracting text-derived prosodic cues, mapping non-English phonemes to ARPAbet, and applying subphoneme-level pitch to realize tonal contours, including Mandarin. Across experiments, PRESENT achieves state-of-the-art CER in zero-shot transfer for German, Hungarian, and Spanish compared to ZM-Text-TTS and demonstrates substantial Mandarin intelligibility gains with tone control, though naturalness remains a limitation due to American-accented speech. The work offers a practical path to rapid cross-lingual and tonal synthesis without additional data or fine-tuning, with potential applications in accented speech generation and coverage of tonal minority languages.
Abstract
Current strategies for achieving fine-grained prosody control in speech synthesis entail extracting additional style embeddings or adopting more complex architectures. To enable zero-shot application of pretrained text-to-speech (TTS) models, we present PRESENT (PRosody Editing without Style Embeddings or New Training), which exploits explicit prosody prediction in FastSpeech2-based models by modifying the inference process directly. We apply our text-to-prosody framework to zero-shot language transfer using a JETS model exclusively trained on English LJSpeech data. We obtain character error rates (CER) of 12.8%, 18.7% and 5.9% for German, Hungarian and Spanish respectively, beating the previous state-of-the-art CER by over 2x for all three languages. Furthermore, we allow subphoneme-level control, a first in this field. To evaluate its effectiveness, we show that PRESENT can improve the prosody of questions, and use it to generate Mandarin, a tonal language where vowel pitch varies at subphoneme level. We attain 25.3% hanzi CER and 13.0% pinyin CER with the JETS model. All our code and audio samples are available online.
