Table of Contents
Fetching ...

CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment

Hanwen Liu, Saierdaer Yusuyin, Hao Huang, Zhijian Ou

TL;DR

CTC-TTS tackles the latency challenge of LLM-based TTS by replacing heavy MFA-based alignment with a CTC-based phoneme–speech alignment and introducing bi-word interleaving blocks that map local phoneme groups to speech tokens. It provides two practical variants, CTC-TTS-L and CTC-TTS-F, to balance synthesis quality and first-packet latency through either sequence-length concatenation or feature-dimension stacking. Across single-speaker streaming and multi-speaker zero-shot tasks, CTC-TTS outperforms fixed-ratio interleaving and MFA-based baselines on intelligibility and latency, while maintaining naturalness. These results demonstrate improved streaming efficiency and cross-speaker generalization, with plans to replace WFST-based G2P and to leverage neural alignment advances; code will be released after acceptance.

Abstract

Large-language-model (LLM)-based text-to-speech (TTS) systems can generate natural speech, but most are not designed for low-latency dual-streaming synthesis. High-quality dual-streaming TTS depends on accurate text--speech alignment and well-designed training sequences that balance synthesis quality and latency. Prior work often relies on GMM-HMM based forced-alignment toolkits (e.g., MFA), which are pipeline-heavy and less flexible than neural aligners; fixed-ratio interleaving of text and speech tokens struggles to capture text--speech alignment regularities. We propose CTC-TTS, which replaces MFA with a CTC based aligner and introduces a bi-word based interleaving strategy. Two variants are designed: CTC-TTS-L (token concatenation along the sequence length) for higher quality and CTC-TTS-F (embedding stacking along the feature dimension) for lower latency. Experiments show that CTC-TTS outperforms fixed-ratio interleaving and MFA-based baselines on streaming synthesis and zero-shot tasks. Speech samples are available at https://ctctts.github.io/.

CTC-TTS: LLM-based dual-streaming text-to-speech with CTC alignment

TL;DR

CTC-TTS tackles the latency challenge of LLM-based TTS by replacing heavy MFA-based alignment with a CTC-based phoneme–speech alignment and introducing bi-word interleaving blocks that map local phoneme groups to speech tokens. It provides two practical variants, CTC-TTS-L and CTC-TTS-F, to balance synthesis quality and first-packet latency through either sequence-length concatenation or feature-dimension stacking. Across single-speaker streaming and multi-speaker zero-shot tasks, CTC-TTS outperforms fixed-ratio interleaving and MFA-based baselines on intelligibility and latency, while maintaining naturalness. These results demonstrate improved streaming efficiency and cross-speaker generalization, with plans to replace WFST-based G2P and to leverage neural alignment advances; code will be released after acceptance.

Abstract

Large-language-model (LLM)-based text-to-speech (TTS) systems can generate natural speech, but most are not designed for low-latency dual-streaming synthesis. High-quality dual-streaming TTS depends on accurate text--speech alignment and well-designed training sequences that balance synthesis quality and latency. Prior work often relies on GMM-HMM based forced-alignment toolkits (e.g., MFA), which are pipeline-heavy and less flexible than neural aligners; fixed-ratio interleaving of text and speech tokens struggles to capture text--speech alignment regularities. We propose CTC-TTS, which replaces MFA with a CTC based aligner and introduces a bi-word based interleaving strategy. Two variants are designed: CTC-TTS-L (token concatenation along the sequence length) for higher quality and CTC-TTS-F (embedding stacking along the feature dimension) for lower latency. Experiments show that CTC-TTS outperforms fixed-ratio interleaving and MFA-based baselines on streaming synthesis and zero-shot tasks. Speech samples are available at https://ctctts.github.io/.
Paper Structure (15 sections, 2 equations, 2 figures, 3 tables)

This paper contains 15 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: One bi-word block in the two text--speech interleaving schemes: (a) CTC-TTS-L and (b) CTC-TTS-F.
  • Figure 2: Overview of CTC-TTS-L. Components include: (1) a G2P model that converts text to phonemes; (2) a CTC-based ASR model for speech--phoneme alignment; (3) a decoder-only LM that models interleaved text and speech tokens; (4) a neural audio codec; and (5) an alignment-and-interleaving module implementing Sections \ref{['sec:align']}--\ref{['sec:interleave']}. CTC-TTS-F shares the same components but uses feature-level stacking (Fig. \ref{['fig:interleaving_schemes']}b) and is omitted for brevity. No prompts are required in single-speaker settings.