An investigation of phrase break prediction in an End-to-End TTS system
Anandaswarup Vadapalli
TL;DR
The paper addresses the challenge of explicit prosody control in End-to-End TTS by integrating external phrase break prediction models. It compares a BLSTM with task-specific embeddings and a fine-tuned BERT model trained on LibriTTS data, embedding predicted breaks into a Tacotron2 + WaveRNN synthesis pipeline. Objective evaluation shows the BERT model better predicts phrase breaks (F1: 92.10) than the BLSTM (F1: 88.91), and subjective ABX tests reveal that punctuated text via either model improves listener comprehension over unpunctuated synthesis, with BERT preferred over BLSTM. The findings demonstrate the value of external phrasing in End-to-End TTS for clarity and intelligibility, with implications for storytelling and other prosody-sensitive applications; future work should address non-pausal cues and newer generative TTS architectures.
Abstract
Purpose: This work explores the use of external phrase break prediction models to enhance listener comprehension in End-to-End Text-to-Speech (TTS) systems. Methods: The effectiveness of these models is evaluated based on listener preferences in subjective tests. Two approaches are explored: (1) a bidirectional LSTM model with task-specific embeddings trained from scratch, and (2) a pre-trained BERT model fine-tuned on phrase break prediction. Both models are trained on a multi-speaker English corpus to predict phrase break locations in text. The End-to-End TTS system used comprises a Tacotron2 model with Dynamic Convolutional Attention for mel spectrogram prediction and a WaveRNN vocoder for waveform generation. Results: The listening tests show a clear preference for text synthesized with predicted phrase breaks over text synthesized without them. Conclusion: These results confirm the value of incorporating external phrasing models within End-to-End TTS to enhance listener comprehension.
