SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity
Hanke Xie, Haopeng Lin, Wenxiao Cao, Dake Guo, Wenjie Tian, Jun Wu, Hanlin Wen, Ruixuan Shang, Hongmei Liu, Zhiqi Jiang, Yuepeng Jiang, Wenxi Chen, Ruiqi Yan, Jiale Qian, Yichao Yan, Shunshun Yin, Ming Tao, Xie Chen, Lei Xie, Xinsheng Wang
TL;DR
SoulX-Podcast introduces a two-stage, LLM-driven framework for realistic long-form, multi-speaker podcast synthesis with explicit paralinguistic and dialectal control. By interleaving text and speech tokens, expanding the codebook to include dialect and paralinguistic cues, and employing cross-dialectal prompting plus context regularization, the system achieves stable long-form dialogue with high speaker consistency and adaptive prosody. The approach demonstrates state-of-the-art performance in both monologue TTS and multi-turn dialogue benchmarks, and extends robust cross-dialect voice cloning capabilities across Mandarin and several Chinese dialects. These contributions offer a versatile, scalable path toward expressive, resource-rich podcast-style speech synthesis with broad practical implications and responsible-use considerations.
Abstract
Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical report presents SoulX-Podcast, a system designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving state-of-the-art performance in conventional TTS tasks. To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. Experimental results demonstrate that SoulX-Podcast can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth speaker transitions. Moreover, speakers exhibit contextually adaptive prosody, reflecting natural rhythm and intonation changes as dialogues progress. Across multiple evaluation metrics, SoulX-Podcast achieves state-of-the-art performance in both monologue TTS and multi-turn conversational speech synthesis.
