Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation

Xinhan Di; Zihao Chen; Yunming Liang; Junjie Zheng; Yihua Wang; Chaofan Ding

Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation

Xinhan Di, Zihao Chen, Yunming Liang, Junjie Zheng, Yihua Wang, Chaofan Ding

TL;DR

Bailing-TTS addresses the challenge of Chinese dialectal speech synthesis by introducing a foundation-model approach that combines continual semi-supervised text–speech alignment with a dialect-aware mixture-of-experts representation and hierarchical RL post-training. The system uses an autoregressive transformer backbone and a multi-stage training pipeline to produce spontaneous, expressive dialectal speech from text, achieving close-to-human naturalness across multiple dialects and exhibiting strong zero-shot and fine-tuning performance. Extensive experiments on Mandarin and various dialects demonstrate competitive WER/MOS and CMOS, along with streaming-efficient inference, signaling practical viability for real-world dialectal TTS. The work outlines applications in dialogue and culture, and highlights future directions toward multi-modal and audio–visual content generation from text and video inputs.

Abstract

Large-scale text-to-speech (TTS) models have made significant progress recently.However, they still fall short in the generation of Chinese dialectal speech. Toaddress this, we propose Bailing-TTS, a family of large-scale TTS models capable of generating high-quality Chinese dialectal speech. Bailing-TTS serves as a foundation model for Chinese dialectal speech generation. First, continual semi-supervised learning is proposed to facilitate the alignment of text tokens and speech tokens. Second, the Chinese dialectal representation learning is developed using a specific transformer architecture and multi-stage training processes. With the proposed design of novel network architecture and corresponding strategy, Bailing-TTS is able to generate Chinese dialectal speech from text effectively and efficiently. Experiments demonstrate that Bailing-TTS generates Chinese dialectal speech towards human-like spontaneous representation. Readers are encouraged to listen to demos at \url{https://c9412600.github.io/bltts_tech_report/index.html}.

Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation

TL;DR

Abstract

Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)