VoXtream2: Full-stream TTS with dynamic speaking rate control

Nikita Torgashov; Gustav Eje Henter; Gabriel Skantze

VoXtream2: Full-stream TTS with dynamic speaking rate control

Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze

Abstract

Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality. Prompt-text masking enables textless audio prompting, removing the need for prompt transcription. Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines despite a smaller model and less training data. In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.

VoXtream2: Full-stream TTS with dynamic speaking rate control

Abstract

Paper Structure (26 sections, 1 equation, 7 figures, 7 tables)

This paper contains 26 sections, 1 equation, 7 figures, 7 tables.

Introduction
Related Work
Speaking Rate Control
Full-Stream TTS
Classifier Free Guidance
Distribution matching
Method
Model architecture
Prompt Text Masking
Classifier Free Guidance
Acoustic Prompt Enhancement
Speaking Rate Control
Experiment Setup
Datasets
Model
...and 11 more sections

Figures (7)

Figure 1: Overview of VoXtream2 architecture.
Figure 2: Speaking rate control mechanism.
Figure 3: Evaluation of text chunk size in a full-stream.
Figure 4: Comparison of different TTS models across various speaking rates for utterance-level control.
Figure 6: Correlation between target and synthesized speaking rates across different TTS models.
...and 2 more figures

VoXtream2: Full-stream TTS with dynamic speaking rate control

Abstract

VoXtream2: Full-stream TTS with dynamic speaking rate control

Authors

Abstract

Table of Contents

Figures (7)