Continuous Speech Tokenizer in Text To Speech

Yixing Li; Ruobing Xie; Xingwu Sun; Yu Cheng; Zhanhui Kang

Continuous Speech Tokenizer in Text To Speech

Yixing Li, Ruobing Xie, Xingwu Sun, Yu Cheng, Zhanhui Kang

TL;DR

The paper addresses information loss in discrete speech tokenizers used for TTS by introducing Cont-SPT, a continuous speech tokenizer that outputs continuous embeddings $\mathcal{A}$ to feed an autoregressive LM. It combines a flow-matching decoder via OT-CFM and a de-noising module to synthesize waveform, with a two-stage training regime that optimizes both token reconstruction ($L_{spt}$) and language modeling ($L_{LM}$), including a tokenizer learning-rate factor of $\mathrm{LR}_{\text{tokenizer}} = 0.05 \times \mathrm{LR}_{\text{LM}}$. On LibriSpeech, Cont-SPT demonstrates improved WER, SIM, and MOS-related metrics, and shows superior high-frequency information retention and robustness to sampling rate and window-length variations compared to discrete baselines. This approach highlights the benefits of continuous speech representations for robust, high-fidelity TTS and lays groundwork for integrating continuous tokens into broader speech-language modeling pipelines, with future work extending to multimodal large language models.

Abstract

The fusion of speech and language in the era of large language models has garnered significant attention. Discrete speech token is often utilized in text-to-speech tasks for speech compression and portability, which is convenient for joint training with text and have good compression efficiency. However, we found that the discrete speech tokenizer still suffers from information loss. Therefore, we propose a simple yet effective continuous speech tokenizer named Cont-SPT, and a text-to-speech model based on continuous speech tokens. Our results show that the speech language model based on the continuous speech tokenizer has better continuity and higher estimated Mean Opinion Scores (MoS). This enhancement is attributed to better information preservation rate of the continuous speech tokenizer across both low and high frequencies in the frequency domain. The code and resources for Cont-SPT can be found in https://github.com/Yixing-Li/Continuous-Speech-Tokenizer

Continuous Speech Tokenizer in Text To Speech

TL;DR

The paper addresses information loss in discrete speech tokenizers used for TTS by introducing Cont-SPT, a continuous speech tokenizer that outputs continuous embeddings

to feed an autoregressive LM. It combines a flow-matching decoder via OT-CFM and a de-noising module to synthesize waveform, with a two-stage training regime that optimizes both token reconstruction (

) and language modeling (

), including a tokenizer learning-rate factor of

. On LibriSpeech, Cont-SPT demonstrates improved WER, SIM, and MOS-related metrics, and shows superior high-frequency information retention and robustness to sampling rate and window-length variations compared to discrete baselines. This approach highlights the benefits of continuous speech representations for robust, high-fidelity TTS and lays groundwork for integrating continuous tokens into broader speech-language modeling pipelines, with future work extending to multimodal large language models.

Continuous Speech Tokenizer in Text To Speech

TL;DR

Abstract

Continuous Speech Tokenizer in Text To Speech

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)