Table of Contents
Fetching ...

SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer

Zhengyan Sheng, Zhihao Du, Shiliang Zhang, Zhijie Yan, Yexin Yang, Zhenhua Ling

TL;DR

SyncSpeech presents a dual-stream TTS system built on a Temporal Masked Transformer that streams incoming text while generating speech in real time. By predicting token-level durations and performing temporally ordered masked predictions, it achieves low latency and high efficiency, enabling seamless integration with upstream LLMs. Across English and Mandarin, it achieves latency reductions and throughput gains while maintaining competitive speech quality and robustness, rivaling autoregressive baselines under the same data scale. The approach combines a two-stage training regimen with masking-based pretraining and a chunk-aware speech decoder to support streaming, with demonstrated potential as a foundational component for speech-enabled SLLMs.

Abstract

This paper presents a dual-stream text-to-speech (TTS) model, SyncSpeech, capable of receiving streaming text input from upstream models while simultaneously generating streaming speech, facilitating seamless interaction with large language models. SyncSpeech has the following advantages: Low latency, as it begins generating streaming speech upon receiving the second text token; High efficiency, as it decodes all speech tokens corresponding to the each arrived text token in one step. To achieve this, we propose a temporal masked transformer as the backbone of SyncSpeech, combined with token-level duration prediction to predict speech tokens and the duration for the next step. Additionally, we design a two-stage training strategy to improve training efficiency and the quality of generated speech. We evaluated the SyncSpeech on both English and Mandarin datasets. Compared to the recent dual-stream TTS models, SyncSpeech significantly reduces the first packet delay of speech tokens and accelerates the real-time factor. Moreover, with the same data scale, SyncSpeech achieves performance comparable to that of traditional autoregressive-based TTS models in terms of both speech quality and robustness. Speech samples are available at https://SyncSpeech.github.io/}{https://SyncSpeech.github.io/.

SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer

TL;DR

SyncSpeech presents a dual-stream TTS system built on a Temporal Masked Transformer that streams incoming text while generating speech in real time. By predicting token-level durations and performing temporally ordered masked predictions, it achieves low latency and high efficiency, enabling seamless integration with upstream LLMs. Across English and Mandarin, it achieves latency reductions and throughput gains while maintaining competitive speech quality and robustness, rivaling autoregressive baselines under the same data scale. The approach combines a two-stage training regimen with masking-based pretraining and a chunk-aware speech decoder to support streaming, with demonstrated potential as a foundational component for speech-enabled SLLMs.

Abstract

This paper presents a dual-stream text-to-speech (TTS) model, SyncSpeech, capable of receiving streaming text input from upstream models while simultaneously generating streaming speech, facilitating seamless interaction with large language models. SyncSpeech has the following advantages: Low latency, as it begins generating streaming speech upon receiving the second text token; High efficiency, as it decodes all speech tokens corresponding to the each arrived text token in one step. To achieve this, we propose a temporal masked transformer as the backbone of SyncSpeech, combined with token-level duration prediction to predict speech tokens and the duration for the next step. Additionally, we design a two-stage training strategy to improve training efficiency and the quality of generated speech. We evaluated the SyncSpeech on both English and Mandarin datasets. Compared to the recent dual-stream TTS models, SyncSpeech significantly reduces the first packet delay of speech tokens and accelerates the real-time factor. Moreover, with the same data scale, SyncSpeech achieves performance comparable to that of traditional autoregressive-based TTS models in terms of both speech quality and robustness. Speech samples are available at https://SyncSpeech.github.io/}{https://SyncSpeech.github.io/.

Paper Structure

This paper contains 36 sections, 10 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: An overview of the proposed SyncSpeech, comprising a text tokenizer, a speech tokenizer, a temporal masked generative transformer and a chunk-aware speech decoder. The figure shows that, with the random number $n=2$ and text look-ahead value $q=1$, it estimates all speech tokens (from $s_8$ to $s_{12}$) corresponding to the text token $y_2$ and the duration ($l_3$) of the next text token $y_3$ in one decoding step.
  • Figure 2: Illustrations of the inference process in two scenarios.The upper part represents the scenario without using speech prompts to control prosody, where in the first step, the duration of the first character needs to be predicted separately; in the subsequent decoding steps, both the current speech token and the duration of the next text token are predicted simultaneously. The lower part shows the illustration of using speech prompts to control prosody, where $y^p$ and $s^p$ denote the text tokens and speech tokens of the speech prompt, respectively.