Table of Contents
Fetching ...

MOSS-TTSD: Text to Spoken Dialogue Generation

Yuqian Zhang, Donghua Yu, Zhengyuan Lin, Botian Jiang, Mingshu Chen, Yaozhou Jiang, Yiwei Zhao, Yiyang Zhang, Yucheng Yuan, Hanfu Chen, Kexin Huang, Jun Zhan, Cheng Chang, Zhaoye Fei, Shimin Li, Xiaogui Yang, Qinyuan Cheng, Xipeng Qiu

Abstract

Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency, and long-form stability, which current models often fail to address due to a lack of dialogue context modeling. To bridge this gap, we present MOSS-TTSD, a spoken dialogue synthesis model designed for expressive, multi-party conversational speech across multiple languages. With enhanced long-context modeling, MOSS-TTSD generates long-form spoken conversations from dialogue scripts with explicit speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from a short reference audio clip. The model supports various mainstream languages, including English and Chinese, and is adapted to several long-form scenarios. Additionally, to address limitations of existing evaluation methods, we propose TTSD-eval, an objective evaluation framework based on forced alignment that measures speaker attribution accuracy and speaker similarity without relying on speaker diarization tools. Both objective and subjective evaluation results show that MOSS-TTSD surpasses strong open-source and proprietary baselines in dialogue synthesis.

MOSS-TTSD: Text to Spoken Dialogue Generation

Abstract

Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency, and long-form stability, which current models often fail to address due to a lack of dialogue context modeling. To bridge this gap, we present MOSS-TTSD, a spoken dialogue synthesis model designed for expressive, multi-party conversational speech across multiple languages. With enhanced long-context modeling, MOSS-TTSD generates long-form spoken conversations from dialogue scripts with explicit speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from a short reference audio clip. The model supports various mainstream languages, including English and Chinese, and is adapted to several long-form scenarios. Additionally, to address limitations of existing evaluation methods, we propose TTSD-eval, an objective evaluation framework based on forced alignment that measures speaker attribution accuracy and speaker similarity without relying on speaker diarization tools. Both objective and subjective evaluation results show that MOSS-TTSD surpasses strong open-source and proprietary baselines in dialogue synthesis.
Paper Structure (17 sections, 5 figures, 6 tables)

This paper contains 17 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of the MOSS-TTSD data pipeline: raw audio is normalized and diarized, then merged into clips with varying speaker counts and annotated with quality, language, and sampling-rate metadata; noise-heavy domains may undergo additional denoising. Clips are transcribed end-to-end with explicit speaker tags, followed by heuristic and audio/text language-consistency filtering to produce the final training set.
  • Figure 2: MOSS-TTSD inference for multi-speaker voice cloning. Given a text prompt with speaker tags, the model conditions on per-speaker reference audio and continues the spoken dialogue by generating discrete speech tokens, unifying reference-conditioned cloning with continuation-based cloning.
  • Figure 3: Overview of TTSD-eval. Given the input script with explicit speaker tags and the generated audio, TTSD-eval uses forced alignment to obtain word-level timestamps and segments the audio into utterance fragments. ACC and SIM are computed from speaker-embedding similarities between each fragment and the reference voices.
  • Figure 4: Elo ratings and confidence intervals of MOSS-TTSD and other open-source models on human-perceived speaker attribution accuracy (ACC), voice similarity (SIM), rhythm, and overall quality.
  • Figure 5: Subjective preference results between MOSS-TTSD and other proprietary models. Bars report win/tie/lose rates of MOSS-TTSD against proprietary baselines in Chinese (ZH) and English (EN), where annotators select the overall preferred sample.