Table of Contents
Fetching ...

Zero-Shot Text-to-Speech from Continuous Text Streams

Trung Dang, David Aponte, Dung Tran, Tianyi Chen, Kazuhito Koishida

TL;DR

LiveSpeech 2 is introduced, a stream-aware model that supports infinitely long speech generation, text-audio stream synchronization, and seamless transitions between short speech chunks and is competitive with state-of-the-art language model-based zero-shot TTS models while also providing flexibility to support a wide range of streaming scenarios.

Abstract

Existing zero-shot text-to-speech (TTS) systems are typically designed to process complete sentences and are constrained by the maximum duration for which they have been trained. However, in many streaming applications, texts arrive continuously in short chunks, necessitating instant responses from the system. We identify the essential capabilities required for chunk-level streaming and introduce LiveSpeech 2, a stream-aware model that supports infinitely long speech generation, text-audio stream synchronization, and seamless transitions between short speech chunks. To achieve these, we propose (1) adopting Mamba, a class of sequence modeling distinguished by linear-time decoding, which is augmented by cross-attention mechanisms for conditioning, (2) utilizing rotary positional embeddings in the computation of cross-attention, enabling the model to process an infinite text stream by sliding a window, and (3) decoding with semantic guidance, a technique that aligns speech with the transcript during inference with minimal overhead. Experimental results demonstrate that our models are competitive with state-of-the-art language model-based zero-shot TTS models, while also providing flexibility to support a wide range of streaming scenarios.

Zero-Shot Text-to-Speech from Continuous Text Streams

TL;DR

LiveSpeech 2 is introduced, a stream-aware model that supports infinitely long speech generation, text-audio stream synchronization, and seamless transitions between short speech chunks and is competitive with state-of-the-art language model-based zero-shot TTS models while also providing flexibility to support a wide range of streaming scenarios.

Abstract

Existing zero-shot text-to-speech (TTS) systems are typically designed to process complete sentences and are constrained by the maximum duration for which they have been trained. However, in many streaming applications, texts arrive continuously in short chunks, necessitating instant responses from the system. We identify the essential capabilities required for chunk-level streaming and introduce LiveSpeech 2, a stream-aware model that supports infinitely long speech generation, text-audio stream synchronization, and seamless transitions between short speech chunks. To achieve these, we propose (1) adopting Mamba, a class of sequence modeling distinguished by linear-time decoding, which is augmented by cross-attention mechanisms for conditioning, (2) utilizing rotary positional embeddings in the computation of cross-attention, enabling the model to process an infinite text stream by sliding a window, and (3) decoding with semantic guidance, a technique that aligns speech with the transcript during inference with minimal overhead. Experimental results demonstrate that our models are competitive with state-of-the-art language model-based zero-shot TTS models, while also providing flexibility to support a wide range of streaming scenarios.
Paper Structure (25 sections, 1 equation, 4 figures, 14 tables, 3 algorithms)

This paper contains 25 sections, 1 equation, 4 figures, 14 tables, 3 algorithms.

Figures (4)

  • Figure 1: LiveSpeech 2 general architecture. An upstream model generates text continuously in small chunks, while our model synthesizes speech, aiming to keep pace with the most recent chunk. Besides enrollment speech embeddings, each decoding step has access to a section of the text stream, including some past and future chunks.
  • Figure 2: For each word token in a chunk $i$, we assign a position index such that the first word has the index of the frame when the chunk arrives $\tau_i$, and subsequent words have incremental indices $\tau_{i} + 1, \dots$
  • Figure 3: Cross-attention visualization for "There is even / a white row of / beehives in the / orchard under the walnut / trees". There are 12 rows for each mamba layer, in which each in the first 6 rows has only one head and each in the last 6 rows has four heads, each predicting 4, 4, 4, 5 codes (total 1 grapheme token + 16 acoustic codes) in a frame, respectively. In each plot, the x-axis represents 318 audio frames generated and the y-axis represents 21 word tokens in the transcript.
  • Figure 4: Cross-attention visualization for "When you go / out of /the house into the / flower garden, there you / feel again the / order and fine / arrangement manifest all over / the great / farm, in the fencing / and hedging, / in the windbreaks and / sheds, in the symmetrical / pasture ponds / planted with scrub / willows to give shade / to the cattle / in fly-time."