Table of Contents
Fetching ...

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Leying Zhang, Yao Qian, Long Zhou, Shujie Liu, Dongmei Wang, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Lei He, Sheng Zhao, Michael Zeng

TL;DR

CoVoMix tackles zero-shot, human-like multi-speaker dialogue generation by fusing a multi-stream text-to-semantic model with a flow-matching acoustic generator and a HiFi-GAN vocoder to produce mixed mel-spectrograms. The approach uses dual semantic token streams per speaker and a conditional flow-based generator to capture overlapping speech, turn-taking, and paralinguistic cues in multi-round conversations. Comprehensive evaluation on the Fisher dataset shows robust zero-shot speaker cloning, high naturalness, coherent dialogue flow, and realistic laughter, supported by objective metrics and human judgments. The work advances practical, single-channel dialogue synthesis with multiple speakers and lays groundwork for extensions to voice conversion and broader impacts, while noting limitations and avenues for improvement.

Abstract

Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix first converts dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model. Furthermore, we devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation. Our experimental results show that CoVoMix can generate dialogues that are not only human-like in their naturalness and coherence but also involve multiple talkers engaging in multiple rounds of conversation. This is exemplified by instances generated in a single channel where one speaker's utterance is seamlessly mixed with another's interjections or laughter, indicating the latter's role as an attentive listener. Audio samples are available at https://aka.ms/covomix.

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

TL;DR

CoVoMix tackles zero-shot, human-like multi-speaker dialogue generation by fusing a multi-stream text-to-semantic model with a flow-matching acoustic generator and a HiFi-GAN vocoder to produce mixed mel-spectrograms. The approach uses dual semantic token streams per speaker and a conditional flow-based generator to capture overlapping speech, turn-taking, and paralinguistic cues in multi-round conversations. Comprehensive evaluation on the Fisher dataset shows robust zero-shot speaker cloning, high naturalness, coherent dialogue flow, and realistic laughter, supported by objective metrics and human judgments. The work advances practical, single-channel dialogue synthesis with multiple speakers and lays groundwork for extensions to voice conversion and broader impacts, while noting limitations and avenues for improvement.

Abstract

Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix first converts dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model. Furthermore, we devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation. Our experimental results show that CoVoMix can generate dialogues that are not only human-like in their naturalness and coherence but also involve multiple talkers engaging in multiple rounds of conversation. This is exemplified by instances generated in a single channel where one speaker's utterance is seamlessly mixed with another's interjections or laughter, indicating the latter's role as an attentive listener. Audio samples are available at https://aka.ms/covomix.
Paper Structure (29 sections, 3 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 29 sections, 3 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: The overview of CoVoMix framework, which consists of a multi-stream text-to-semantic model, a conditional flow-matching based acoustic model for mixed mel-spectrogram generation, and a HiFi-GAN based vocoder for waveform production.
  • Figure 2: Dialogue transcription preparation. To better demonstrate our method, we use | and emoji to represent [spkchange] and [laughter] tokens.
  • Figure 3: Distribution of durations of turn-taking events across models. The blue line and the green line represent the median and mean of each event. The more similar to groundtruth, the better.
  • Figure 4: Comparison of number and duration of laughter among models
  • Figure 5: Speech consistency of CoVoSingle and CoVoMix for dialogue generation
  • ...and 7 more figures