Table of Contents
Fetching ...

An investigation of phrase break prediction in an End-to-End TTS system

Anandaswarup Vadapalli

TL;DR

The paper addresses the challenge of explicit prosody control in End-to-End TTS by integrating external phrase break prediction models. It compares a BLSTM with task-specific embeddings and a fine-tuned BERT model trained on LibriTTS data, embedding predicted breaks into a Tacotron2 + WaveRNN synthesis pipeline. Objective evaluation shows the BERT model better predicts phrase breaks (F1: 92.10) than the BLSTM (F1: 88.91), and subjective ABX tests reveal that punctuated text via either model improves listener comprehension over unpunctuated synthesis, with BERT preferred over BLSTM. The findings demonstrate the value of external phrasing in End-to-End TTS for clarity and intelligibility, with implications for storytelling and other prosody-sensitive applications; future work should address non-pausal cues and newer generative TTS architectures.

Abstract

Purpose: This work explores the use of external phrase break prediction models to enhance listener comprehension in End-to-End Text-to-Speech (TTS) systems. Methods: The effectiveness of these models is evaluated based on listener preferences in subjective tests. Two approaches are explored: (1) a bidirectional LSTM model with task-specific embeddings trained from scratch, and (2) a pre-trained BERT model fine-tuned on phrase break prediction. Both models are trained on a multi-speaker English corpus to predict phrase break locations in text. The End-to-End TTS system used comprises a Tacotron2 model with Dynamic Convolutional Attention for mel spectrogram prediction and a WaveRNN vocoder for waveform generation. Results: The listening tests show a clear preference for text synthesized with predicted phrase breaks over text synthesized without them. Conclusion: These results confirm the value of incorporating external phrasing models within End-to-End TTS to enhance listener comprehension.

An investigation of phrase break prediction in an End-to-End TTS system

TL;DR

The paper addresses the challenge of explicit prosody control in End-to-End TTS by integrating external phrase break prediction models. It compares a BLSTM with task-specific embeddings and a fine-tuned BERT model trained on LibriTTS data, embedding predicted breaks into a Tacotron2 + WaveRNN synthesis pipeline. Objective evaluation shows the BERT model better predicts phrase breaks (F1: 92.10) than the BLSTM (F1: 88.91), and subjective ABX tests reveal that punctuated text via either model improves listener comprehension over unpunctuated synthesis, with BERT preferred over BLSTM. The findings demonstrate the value of external phrasing in End-to-End TTS for clarity and intelligibility, with implications for storytelling and other prosody-sensitive applications; future work should address non-pausal cues and newer generative TTS architectures.

Abstract

Purpose: This work explores the use of external phrase break prediction models to enhance listener comprehension in End-to-End Text-to-Speech (TTS) systems. Methods: The effectiveness of these models is evaluated based on listener preferences in subjective tests. Two approaches are explored: (1) a bidirectional LSTM model with task-specific embeddings trained from scratch, and (2) a pre-trained BERT model fine-tuned on phrase break prediction. Both models are trained on a multi-speaker English corpus to predict phrase break locations in text. The End-to-End TTS system used comprises a Tacotron2 model with Dynamic Convolutional Attention for mel spectrogram prediction and a WaveRNN vocoder for waveform generation. Results: The listening tests show a clear preference for text synthesized with predicted phrase breaks over text synthesized without them. Conclusion: These results confirm the value of incorporating external phrasing models within End-to-End TTS to enhance listener comprehension.
Paper Structure (17 sections, 5 figures, 4 tables)

This paper contains 17 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: BLSTM token classification model using task-specific static embeddings trained from scratch. Inputs to the model are word embeddings which are randomly initialized and jointly trained along with the model on the task at hand, outputs of the model are probabilities from a softmax layer over the set of possible tags (B or NB).
  • Figure 2: BERT model with a token classification head. The BERT model was pre-trained on uncased English text and was later fine-tuned on phrase break prediction. The classification layer was randomly intialized and it's parameters were learnt from scratch.
  • Figure 3: Architecture of the Tacotron2 model. The Tacotron2 model takes text as input and predicts a sequence of mel spectrogram frames as output.
  • Figure 4: Architecture of the WaveRNN model. The WaveRNN model takes the mel spectrogram predicted by the Tacotron2 model as input and generates a waveform as output.
  • Figure 5: Screen shot of the ABX test interface