Table of Contents
Fetching ...

DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching

Hanke Xie, Dake Guo, Chengyou Wang, Yue Li, Wenjie Tian, Xinfa Zhu, Xinsheng Wang, Xiulin Li, Guanqiong Miao, Bo Liu, Lei Xie

TL;DR

DialoSpeech tackles the challenge of natural, interactive multi-speaker dialogue synthesis by fusing an LLM-guided dual-track token generator (DiaLM) with a streaming Chunked Flow Matching acoustic model. The approach includes a scalable Dual-Track Dialogue Data Pipeline to build speaker-labeled, overlap-aware data and a two-stage generation process that handles inter-speaker dynamics, turn-taking, and overlaps, followed by memory-efficient, chunked waveform reconstruction. Experiments in Chinese and English show DialoSpeech outperforms strong baselines on subjective measures of spontaneity and coherence, with competitive objective metrics and robust cross-lingual generalization under limited English data. The work provides a practical, scalable framework for expressive dialogue speech synthesis and offers resources to advance future research in zero-shot, dual-speaker TTS.

Abstract

Recent advances in text-to-speech (TTS) synthesis, particularly those leveraging large language models (LLMs), have significantly improved expressiveness and naturalness. However, generating human-like, interactive dialogue speech remains challenging. Current systems face limitations due to the scarcity of dual-track data and difficulties in achieving naturalness, contextual coherence, and interactional dynamics, such as turn-taking, overlapping speech, and speaker consistency, in multi-turn conversations. To address these challenges, we propose DialoSpeech, a dual-track architecture combining a large language model with Chunked Flow Matching for expressive, human-like dialogue speech synthesis. DialoSpeech generates natural multi-turn conversations with coherent speaker turns and natural overlaps, supporting both Chinese and English and cross-lingual speech synthesis. We introduce a data processing pipeline to construct dual-track dialogue datasets, facilitating scalable training and experimental validation. Experiments show that our model outperforms baselines, offering a solution for generating human-like spoken dialogues. Audio samples are available at https://tiamojames.github.io/DialoSpeech

DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching

TL;DR

DialoSpeech tackles the challenge of natural, interactive multi-speaker dialogue synthesis by fusing an LLM-guided dual-track token generator (DiaLM) with a streaming Chunked Flow Matching acoustic model. The approach includes a scalable Dual-Track Dialogue Data Pipeline to build speaker-labeled, overlap-aware data and a two-stage generation process that handles inter-speaker dynamics, turn-taking, and overlaps, followed by memory-efficient, chunked waveform reconstruction. Experiments in Chinese and English show DialoSpeech outperforms strong baselines on subjective measures of spontaneity and coherence, with competitive objective metrics and robust cross-lingual generalization under limited English data. The work provides a practical, scalable framework for expressive dialogue speech synthesis and offers resources to advance future research in zero-shot, dual-speaker TTS.

Abstract

Recent advances in text-to-speech (TTS) synthesis, particularly those leveraging large language models (LLMs), have significantly improved expressiveness and naturalness. However, generating human-like, interactive dialogue speech remains challenging. Current systems face limitations due to the scarcity of dual-track data and difficulties in achieving naturalness, contextual coherence, and interactional dynamics, such as turn-taking, overlapping speech, and speaker consistency, in multi-turn conversations. To address these challenges, we propose DialoSpeech, a dual-track architecture combining a large language model with Chunked Flow Matching for expressive, human-like dialogue speech synthesis. DialoSpeech generates natural multi-turn conversations with coherent speaker turns and natural overlaps, supporting both Chinese and English and cross-lingual speech synthesis. We introduce a data processing pipeline to construct dual-track dialogue datasets, facilitating scalable training and experimental validation. Experiments show that our model outperforms baselines, offering a solution for generating human-like spoken dialogues. Audio samples are available at https://tiamojames.github.io/DialoSpeech

Paper Structure

This paper contains 14 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the dual-track dialogue data processing pipeline. The key stages include initial segmentation, parallel ASR and speaker diarization, word-to-speaker alignment, punctuation annotation, overlapped speech detection, and speaker separation.
  • Figure 2: Overview of the DialoSpeech model architecture. It illustrates the flow from input dialogue text, through LLM-based contextual and interactional guidance, to the dual-track prediction of speech tokens, which are independently synthesized into speech for each speaker via Flow Matching and combined with a neural vocoder.
  • Figure 3: Overview of the DiaLM training framework. Raw dual-speaker waveforms are first processed via the Dual-Track Data Pipeline to obtain semantic token sequences and speaker embeddings for each channel. Left-channel and right-channel tokens are embedded via a shared embedding layer and then passed through a causal cross-attention module to enable inter-speaker interaction. The resulting fused representation is concatenated with the textual embedding and fed into an LLaMA-based language model. The model outputs hidden states for both channels, which are projected to token logits via separate channel-specific heads. A dual-channel cross-entropy loss is applied to supervise both output streams.
  • Figure 4: The details of the fundamental chunk-wise attention mask.