Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech

Eric Battenberg; RJ Skerry-Ryan; Daisy Stanton; Soroosh Mariooryad; Matt Shannon; Julian Salazar; David Kao

Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech

Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad, Matt Shannon, Julian Salazar, David Kao

TL;DR

This work tackles the longstanding challenge of robustness and unbounded length generalization in autoregressive Transformer-based TTS by introducing Very Attentive Tacotron (VAT), which augments cross-attention with alignment-informed, interpolated relative position biases and a learned monotonic alignment layer. By maintaining multi-head self- and cross-attention while injecting a monotone alignment signal, VAT achieves near-unbounded length generalization and eliminates common AR-TTS issues such as word drops and repetitions, while preserving naturalness comparable to a strong T5-based baseline. The approach leverages a VQ-VAE spectrogram discretization and GAN-based vocoder, with per-speaker embeddings facilitating practical multi-speaker scenarios. Across two English datasets, VAT demonstrates robust length generalization, improved ASR-based robustness, and resilience to repeated-word inputs, offering a scalable, transferable solution for long-form TTS in production settings.

Abstract

Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backpropagation and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.

Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech

TL;DR

Abstract

Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)