Table of Contents
Fetching ...

VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency

Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze

TL;DR

VoXtream tackles ultra-low-latency streaming TTS under a zero-shot setting by introducing a three-part autoregressive architecture that encodes text incrementally and emits speech with minimal delay. The Phoneme Transformer, Temporal Transformer, and Depth Transformer work together with a monotonic alignment and the Mimi codec to produce audio tokens in real time, achieving a first-packet latency as low as 102 ms on GPU. Trained on 9k hours of data, VoXtream matches or exceeds larger baselines on several metrics while supporting full-stream operation with minor degradations in quality. The work demonstrates that strong streaming TTS performance can be achieved with modest data, enabling practical, real-time voice synthesis for interactive applications.

Abstract

We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a limited look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at https://herimor.github.io/voxtream.

VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency

TL;DR

VoXtream tackles ultra-low-latency streaming TTS under a zero-shot setting by introducing a three-part autoregressive architecture that encodes text incrementally and emits speech with minimal delay. The Phoneme Transformer, Temporal Transformer, and Depth Transformer work together with a monotonic alignment and the Mimi codec to produce audio tokens in real time, achieving a first-packet latency as low as 102 ms on GPU. Trained on 9k hours of data, VoXtream matches or exceeds larger baselines on several metrics while supporting full-stream operation with minor degradations in quality. The work demonstrates that strong streaming TTS performance can be achieved with modest data, enabling practical, real-time voice synthesis for interactive applications.

Abstract

We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a limited look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at https://herimor.github.io/voxtream.

Paper Structure

This paper contains 8 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Overview of VoXtream, comprising an incremental Phoneme Transformer and Temporal and Depth Transformers.