DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching
Hanke Xie, Dake Guo, Chengyou Wang, Yue Li, Wenjie Tian, Xinfa Zhu, Xinsheng Wang, Xiulin Li, Guanqiong Miao, Bo Liu, Lei Xie
TL;DR
DialoSpeech tackles the challenge of natural, interactive multi-speaker dialogue synthesis by fusing an LLM-guided dual-track token generator (DiaLM) with a streaming Chunked Flow Matching acoustic model. The approach includes a scalable Dual-Track Dialogue Data Pipeline to build speaker-labeled, overlap-aware data and a two-stage generation process that handles inter-speaker dynamics, turn-taking, and overlaps, followed by memory-efficient, chunked waveform reconstruction. Experiments in Chinese and English show DialoSpeech outperforms strong baselines on subjective measures of spontaneity and coherence, with competitive objective metrics and robust cross-lingual generalization under limited English data. The work provides a practical, scalable framework for expressive dialogue speech synthesis and offers resources to advance future research in zero-shot, dual-speaker TTS.
Abstract
Recent advances in text-to-speech (TTS) synthesis, particularly those leveraging large language models (LLMs), have significantly improved expressiveness and naturalness. However, generating human-like, interactive dialogue speech remains challenging. Current systems face limitations due to the scarcity of dual-track data and difficulties in achieving naturalness, contextual coherence, and interactional dynamics, such as turn-taking, overlapping speech, and speaker consistency, in multi-turn conversations. To address these challenges, we propose DialoSpeech, a dual-track architecture combining a large language model with Chunked Flow Matching for expressive, human-like dialogue speech synthesis. DialoSpeech generates natural multi-turn conversations with coherent speaker turns and natural overlaps, supporting both Chinese and English and cross-lingual speech synthesis. We introduce a data processing pipeline to construct dual-track dialogue datasets, facilitating scalable training and experimental validation. Experiments show that our model outperforms baselines, offering a solution for generating human-like spoken dialogues. Audio samples are available at https://tiamojames.github.io/DialoSpeech
