Table of Contents
Fetching ...

TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch

Xingchen Song, Mengtao Xing, Changwei Ma, Shengqiang Li, Di Wu, Binbin Zhang, Fuping Pan, Dinghao Zhou, Yuekai Zhang, Shun Lei, Zhendong Peng, Zhiyong Wu

TL;DR

TouchTTS addresses data scaling and deployment efficiency challenges in LLM-based TTS by removing multiple preprocessing stages and replacing the flow with a pure transformer backbone. It introduces a simplified data pipeline enabled by S3Tokenizer and a WeNet-style chunk-based architecture that supports both streaming and non-streaming inference, achieving data retention over 50% and enabling training on approximately 1M hours of data. The paper also investigates unifying TTS and ASR by sharing the same LLM and data, showing continuous features outperform discrete tokens for ASR within this setup. Experiments on Seed-Eval demonstrate competitive PER and SIM with practical latency, suggesting the approach offers a scalable, deployment-friendly path for cross-task speech foundation models.

Abstract

It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated pipelines require excellent models at each stage (e.g., speech denoising, speech enhancement, speaker diarization, and punctuation models), which themselves demand high-quality training data and are rarely open-sourced. Even with state-of-the-art models, issues persist, such as incomplete background noise removal and misalignment between punctuation and actual speech pauses. Moreover, the stringent filtering strategies often retain only 10-30\% of the original data, significantly impeding data scaling efforts. In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to design a simplified yet effective TTS data processing pipeline that maintains data quality while substantially reducing data acquisition costs, achieving a data retention rate of over 50\%. Beyond data scaling challenges, LLM-based TTS systems also incur higher deployment costs compared to conventional approaches. Current systems typically use LLMs solely for text-to-token generation, while requiring separate models (e.g., flow matching models) for token-to-waveform generation, which cannot be directly executed by LLM inference engines, further complicating deployment. To address these challenges, we eliminate redundant modules in both LLM and flow components, replacing the flow model backbone with an LLM architecture. Building upon this simplified flow backbone, we propose a unified architecture for both streaming and non-streaming inference, significantly reducing deployment costs. Finally, we explore the feasibility of unifying TTS and ASR tasks using the same data for training, thanks to the simplified pipeline and the S3Tokenizer that reduces the quality requirements for TTS training data.

TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch

TL;DR

TouchTTS addresses data scaling and deployment efficiency challenges in LLM-based TTS by removing multiple preprocessing stages and replacing the flow with a pure transformer backbone. It introduces a simplified data pipeline enabled by S3Tokenizer and a WeNet-style chunk-based architecture that supports both streaming and non-streaming inference, achieving data retention over 50% and enabling training on approximately 1M hours of data. The paper also investigates unifying TTS and ASR by sharing the same LLM and data, showing continuous features outperform discrete tokens for ASR within this setup. Experiments on Seed-Eval demonstrate competitive PER and SIM with practical latency, suggesting the approach offers a scalable, deployment-friendly path for cross-task speech foundation models.

Abstract

It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated pipelines require excellent models at each stage (e.g., speech denoising, speech enhancement, speaker diarization, and punctuation models), which themselves demand high-quality training data and are rarely open-sourced. Even with state-of-the-art models, issues persist, such as incomplete background noise removal and misalignment between punctuation and actual speech pauses. Moreover, the stringent filtering strategies often retain only 10-30\% of the original data, significantly impeding data scaling efforts. In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to design a simplified yet effective TTS data processing pipeline that maintains data quality while substantially reducing data acquisition costs, achieving a data retention rate of over 50\%. Beyond data scaling challenges, LLM-based TTS systems also incur higher deployment costs compared to conventional approaches. Current systems typically use LLMs solely for text-to-token generation, while requiring separate models (e.g., flow matching models) for token-to-waveform generation, which cannot be directly executed by LLM inference engines, further complicating deployment. To address these challenges, we eliminate redundant modules in both LLM and flow components, replacing the flow model backbone with an LLM architecture. Building upon this simplified flow backbone, we propose a unified architecture for both streaming and non-streaming inference, significantly reducing deployment costs. Finally, we explore the feasibility of unifying TTS and ASR tasks using the same data for training, thanks to the simplified pipeline and the S3Tokenizer that reduces the quality requirements for TTS training data.

Paper Structure

This paper contains 15 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of the Simplified Data Processing Pipeline. The blue blocks represent the common data processing pipeline of the current LLM-based TTS systems du2024cosyvoiceliao2024fishguo2024fireredttschen2024takinma2024wenetspeech4ttshe2024emiliayu2024autoprep, and the red cross represents the parts removed from our pipeline. The green blocks represent the parts added to our pipeline.
  • Figure 2: Overview of the Simplified TTS Architecture. The blue blocks represent the common LLM-based TTS architecture, and the red cross represent the parts removed from our architecture. The green blocks and green lines represent the parts that we added or simplified. "S" denotes start of the sentence and "E" denotes end of the sentence. Note that in the streaming mode, the chunk size can be arbitrary, here we assume it is 2 for convenience.
  • Figure 3: Illustration of the training and inference pipeline for unified streaming and non-streaming synthesis. As shown in the figure, during training (left), after selecting dynamic chunk lengths, Case 1 represents chunks without overlap, while Case 2 represents chunks with one chunk length of historical context overlap. The receptive field of the current chunk is masked, with chunk length and historical context length varying dynamically during training. During inference (right), the current chunk's receptive field is set to full coverage, and shorter token overlaps are used between chunks to smooth mel-spectrogram boundaries.
  • Figure 4: Overview of Unified TTS and ASR. In the unified architecture, ASR and TTS share the same LLM, but process their own inputs and outputs separately. The input of TTS is the speaker embedding (blue hollow square) and text token (yellow hollow square), and the output is the audio token (purple hollow square) encoded by S3Tokenizer. The input of ASR is the last layer hidden output of S3Tokenizer transformed by projector ma2024embarrassingly (purple solid square), and the output is the text token.
  • Figure 5: Comparison between single TTS model and unified TTS/ASR model. For TTS task, we calculate PER on test-zh, for ASR task, we calculate CER on SpeechIOSpeechIO. For simplicity, we only count the error rate trend for the first 60k steps (approximately 0.4M hours training data), as these trends are sufficient to demonstrate the advantages and disadvantages of different methods.