Table of Contents
Fetching ...

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Rafael Valle, Rohan Badlani, Boris Ginsburg

TL;DR

This work tackles robustness issues in LLM-based text-to-speech (TTS) caused by misaligned and repetitive speech generated from challenging text. It introduces a monotonic alignment learning framework that guides cross-attention in an encoder-decoder T5-based TTS model using a static attention prior and a CTCLoss-based alignment objective, without adding new parameters. By applying this approach to multi-codebook acoustic representations (via Encodec, Dac, and spectral codecs) and evaluating on seen and unseen speakers, the method achieves substantial improvements in intelligibility (lower CER/WER) and naturalness (MOS), with the decoder-context variant performing best for unseen speakers. The results demonstrate that enforcing monotonic text-speech alignment can significantly enhance the reliability of LLM-based TTS across challenging inputs and speakers, with spectral codecs offering practical benefits for parallel codebook prediction.

Abstract

Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers. However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech (referred to as hallucinations or attention errors), especially when the text contains multiple occurrences of the same token. We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text. To make the alignment more robust, we propose techniques utilizing CTC loss and attention priors that encourage monotonic cross-attention over the text tokens. Our guided attention training technique does not introduce any new learnable parameters and significantly improves robustness of LLM-based TTS models.

Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

TL;DR

This work tackles robustness issues in LLM-based text-to-speech (TTS) caused by misaligned and repetitive speech generated from challenging text. It introduces a monotonic alignment learning framework that guides cross-attention in an encoder-decoder T5-based TTS model using a static attention prior and a CTCLoss-based alignment objective, without adding new parameters. By applying this approach to multi-codebook acoustic representations (via Encodec, Dac, and spectral codecs) and evaluating on seen and unseen speakers, the method achieves substantial improvements in intelligibility (lower CER/WER) and naturalness (MOS), with the decoder-context variant performing best for unseen speakers. The results demonstrate that enforcing monotonic text-speech alignment can significantly enhance the reliability of LLM-based TTS across challenging inputs and speakers, with spectral codecs offering practical benefits for parallel codebook prediction.

Abstract

Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers. However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech (referred to as hallucinations or attention errors), especially when the text contains multiple occurrences of the same token. We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text. To make the alignment more robust, we propose techniques utilizing CTC loss and attention priors that encourage monotonic cross-attention over the text tokens. Our guided attention training technique does not introduce any new learnable parameters and significantly improves robustness of LLM-based TTS models.
Paper Structure (13 sections, 7 equations, 1 figure, 3 tables)

This paper contains 13 sections, 7 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Model Overview: (Left) The T5-TTS model takes as input text tokens and acoustic codes of reference audio and predicts the acoustic codes of the target audio. The figure shows both context input location options. (Right) The cross-attention scores implicitly learn text and speech alignment, but can be guided to learn more robust alignment with attention prior and alignment loss $L_{\textit{align}}$