Table of Contents
Fetching ...

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

Leying Zhang, Yao Qian, Xiaofei Wang, Manthan Thakker, Dongmei Wang, Jianwei Yu, Haibin Wu, Yuxuan Hu, Jinyu Li, Yanmin Qian, Sheng Zhao

TL;DR

CoVoMix2 presents a fully non-autoregressive, flow-matching framework for zero-shot multi-talker dialogue generation that directly predicts mel-spectrograms from disentangled transcripts. It introduces transcription-level speaker disentanglement, sentence-level alignment, and prompt-level random masking to achieve precise overlap control and robust speaker identity without intermediate representations. Through curriculum learning and diverse data mixing, it achieves state-of-the-art results among open-source baselines with faster inference and improved speaker consistency, including natural overlapping speech. This approach enables practical applications like podcast creation and video dubbing, while offering good generalization to real-world scenarios, albeit with considerations for data quality and potential misuse.

Abstract

Generating natural-sounding, multi-speaker dialogue is crucial for applications such as podcast creation, virtual agents, and multimedia content generation. However, existing systems struggle to maintain speaker consistency, model overlapping speech, and synthesize coherent conversations efficiently. In this paper, we introduce CoVoMix2, a fully non-autoregressive framework for zero-shot multi-talker dialogue generation. CoVoMix2 directly predicts mel-spectrograms from multi-stream transcriptions using a flow-matching-based generative model, eliminating the reliance on intermediate token representations. To better capture realistic conversational dynamics, we propose transcription-level speaker disentanglement, sentence-level alignment, and prompt-level random masking strategies. Our approach achieves state-of-the-art performance, outperforming strong baselines like MoonCast and Sesame in speech quality, speaker consistency, and inference speed. Notably, CoVoMix2 operates without requiring transcriptions for the prompt and supports controllable dialogue generation, including overlapping speech and precise timing control, demonstrating strong generalizability to real-world speech generation scenarios.

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

TL;DR

CoVoMix2 presents a fully non-autoregressive, flow-matching framework for zero-shot multi-talker dialogue generation that directly predicts mel-spectrograms from disentangled transcripts. It introduces transcription-level speaker disentanglement, sentence-level alignment, and prompt-level random masking to achieve precise overlap control and robust speaker identity without intermediate representations. Through curriculum learning and diverse data mixing, it achieves state-of-the-art results among open-source baselines with faster inference and improved speaker consistency, including natural overlapping speech. This approach enables practical applications like podcast creation and video dubbing, while offering good generalization to real-world scenarios, albeit with considerations for data quality and potential misuse.

Abstract

Generating natural-sounding, multi-speaker dialogue is crucial for applications such as podcast creation, virtual agents, and multimedia content generation. However, existing systems struggle to maintain speaker consistency, model overlapping speech, and synthesize coherent conversations efficiently. In this paper, we introduce CoVoMix2, a fully non-autoregressive framework for zero-shot multi-talker dialogue generation. CoVoMix2 directly predicts mel-spectrograms from multi-stream transcriptions using a flow-matching-based generative model, eliminating the reliance on intermediate token representations. To better capture realistic conversational dynamics, we propose transcription-level speaker disentanglement, sentence-level alignment, and prompt-level random masking strategies. Our approach achieves state-of-the-art performance, outperforming strong baselines like MoonCast and Sesame in speech quality, speaker consistency, and inference speed. Notably, CoVoMix2 operates without requiring transcriptions for the prompt and supports controllable dialogue generation, including overlapping speech and precise timing control, demonstrating strong generalizability to real-world speech generation scenarios.

Paper Structure

This paper contains 32 sections, 2 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: The overview of the proposed CoVoMix2 framework
  • Figure 2: Example of the input data organization
  • Figure 3: Data processing pipeline
  • Figure 4: Speaker consistency analysis across dialogue turns. (a) Average speaker similarity between each generated turn and its corresponding prompt. (b) Pairwise speaker similarity between turns from the same speaker within a dialogue. Consistent color indicates stable speaker timbre across turns.
  • Figure 5: Mel-spectrogram comparison between overlapping samples generated by NotebookLM, CoVoMix, CoVoMix2 and the real sample.
  • ...and 1 more figures