Table of Contents
Fetching ...

LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes

Trung Dang, David Aponte, Dung Tran, Kazuhito Koishida

TL;DR

LiveSpeech targets low-latency zero-shot TTS by modeling audio as discrete RVQ tokens with a fully autoregressive transformer conditioned on text and enrollment speech. It introduces adaptive codebook weights to reallocate capacity across $Q$ codes per frame and parallel codebook group heads to decode $G$ groups in parallel, enabling up to $Q=16$ codes per frame without incurring extra latency. Trained on LibriLight with LibriTTS evaluation, it achieves competitive CER/WER/PER, SS, and O-MOS while maintaining latency around 200 ms and real-time factor comparable to state-of-the-art baselines, making it suitable for streaming applications. These contributions demonstrate effective streaming zero-shot TTS using autoregressive discrete-token generation with maintained audio fidelity.

Abstract

Prior works have demonstrated zero-shot text-to-speech by using a generative language model on audio tokens obtained via a neural audio codec. It is still challenging, however, to adapt them to low-latency scenarios. In this paper, we present LiveSpeech - a fully autoregressive language model-based approach for zero-shot text-to-speech, enabling low-latency streaming of the output audio. To allow multiple token prediction within a single decoding step, we propose (1) using adaptive codebook loss weights that consider codebook contribution in each frame and focus on hard instances, and (2) grouping codebooks and processing groups in parallel. Experiments show our proposed models achieve competitive results to state-of-the-art baselines in terms of content accuracy, speaker similarity, audio quality, and inference speed while being suitable for low-latency streaming applications.

LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes

TL;DR

LiveSpeech targets low-latency zero-shot TTS by modeling audio as discrete RVQ tokens with a fully autoregressive transformer conditioned on text and enrollment speech. It introduces adaptive codebook weights to reallocate capacity across codes per frame and parallel codebook group heads to decode groups in parallel, enabling up to codes per frame without incurring extra latency. Trained on LibriLight with LibriTTS evaluation, it achieves competitive CER/WER/PER, SS, and O-MOS while maintaining latency around 200 ms and real-time factor comparable to state-of-the-art baselines, making it suitable for streaming applications. These contributions demonstrate effective streaming zero-shot TTS using autoregressive discrete-token generation with maintained audio fidelity.

Abstract

Prior works have demonstrated zero-shot text-to-speech by using a generative language model on audio tokens obtained via a neural audio codec. It is still challenging, however, to adapt them to low-latency scenarios. In this paper, we present LiveSpeech - a fully autoregressive language model-based approach for zero-shot text-to-speech, enabling low-latency streaming of the output audio. To allow multiple token prediction within a single decoding step, we propose (1) using adaptive codebook loss weights that consider codebook contribution in each frame and focus on hard instances, and (2) grouping codebooks and processing groups in parallel. Experiments show our proposed models achieve competitive results to state-of-the-art baselines in terms of content accuracy, speaker similarity, audio quality, and inference speed while being suitable for low-latency streaming applications.
Paper Structure (18 sections, 1 equation, 3 figures, 4 tables)

This paper contains 18 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: (Left) Our proposed architecture. Our model consists of a neural audio codec to convert between waveforms and discrete codes, a speech encoder to infer enrollment embeddings, and a transformer decoder to generate discrete tokens from conditions. (Right) Transformer decoder with parallel codebook group heads
  • Figure 2: Transcript error rates (CER, WER) and speaker similarity scores (SS$_{e}$) of the reference audio decoded by the number of codebooks used. 'ref' represents the original audio. Metrics are given details in Section \ref{['sec:metrics']}
  • Figure 3: Different decoding pattern to generate RVQ codes with $Q=4$ codebooks: VALL-E vall-e, Flatten audiolm, and Delayed musicgen. Both VALL-E and Flatten require autoregressive decoding in both the depth and width dimension, in which VALL-E uses a non-autoregressive transformer from the second codebook. The Delayed pattern only needs to perform autoregressive decoding in one dimension. Moreover, all codes in each autoregressive step can be predicted in a single transformer query.