Table of Contents
Fetching ...

Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech

Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad, Matt Shannon, Julian Salazar, David Kao

TL;DR

This work tackles the longstanding challenge of robustness and unbounded length generalization in autoregressive Transformer-based TTS by introducing Very Attentive Tacotron (VAT), which augments cross-attention with alignment-informed, interpolated relative position biases and a learned monotonic alignment layer. By maintaining multi-head self- and cross-attention while injecting a monotone alignment signal, VAT achieves near-unbounded length generalization and eliminates common AR-TTS issues such as word drops and repetitions, while preserving naturalness comparable to a strong T5-based baseline. The approach leverages a VQ-VAE spectrogram discretization and GAN-based vocoder, with per-speaker embeddings facilitating practical multi-speaker scenarios. Across two English datasets, VAT demonstrates robust length generalization, improved ASR-based robustness, and resilience to repeated-word inputs, offering a scalable, transferable solution for long-form TTS in production settings.

Abstract

Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backpropagation and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.

Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech

TL;DR

This work tackles the longstanding challenge of robustness and unbounded length generalization in autoregressive Transformer-based TTS by introducing Very Attentive Tacotron (VAT), which augments cross-attention with alignment-informed, interpolated relative position biases and a learned monotonic alignment layer. By maintaining multi-head self- and cross-attention while injecting a monotone alignment signal, VAT achieves near-unbounded length generalization and eliminates common AR-TTS issues such as word drops and repetitions, while preserving naturalness comparable to a strong T5-based baseline. The approach leverages a VQ-VAE spectrogram discretization and GAN-based vocoder, with per-speaker embeddings facilitating practical multi-speaker scenarios. Across two English datasets, VAT demonstrates robust length generalization, improved ASR-based robustness, and resilience to repeated-word inputs, offering a scalable, transferable solution for long-form TTS in production settings.

Abstract

Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backpropagation and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.

Paper Structure

This paper contains 60 sections, 7 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Unlike the baseline T5-based TTS system, Very Attentive Tacotron (VAT) is able to generalize to transcripts of virtually unbounded length despite only training on utterances shorter than 9.6 seconds.
  • Figure 2: High-level discrete AR Transformer TTS system overview (left), T5 baseline decoder based on raffel2020exploring:t5 (center), and the VAT decoder (right). Decoder blocks are expanded in Figure \ref{['fig:decoder-blocks']}.
  • Figure 3: Diagrams for decoder sub-blocks.
  • Figure 4: Standard RPB mapping of distances to bias matrix indices (top) and Interpolated RPB mapping of distances to bias matrix index weights (bottom).
  • Figure 5: Visualizing IRPB matrix initialization.
  • ...and 6 more figures