Table of Contents
Fetching ...

DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue

Xiang Li, Duyi Pan, Hongru Xiao, Jiale Han, Jing Tang, Jiabao Ma, Wei Wang, Bo Cheng

TL;DR

DialogueAgents addresses the high cost and limited expressiveness of existing multi-party dialogue data by introducing a hybrid three-agent framework (Script Writer, Speech Synthesizer, Dialogue Critic) that iteratively refines scripts and synthesized speech. The approach leverages a diverse character pool and paralinguistic tokens to enhance emotion and turn-taking, producing the bilingual MultiTalk dataset. Empirical results indicate improved naturalness, emotiveness, and turn-taking with iterative refinement, identifying two iterations as optimal before diminishing returns. The work delivers a scalable data-generation pipeline and a valuable resource for advancing speech synthesis in complex dialogue settings with practical implications for multilingual, multi-speaker applications.

Abstract

Speech synthesis is crucial for human-computer interaction, enabling natural and intuitive communication. However, existing datasets involve high construction costs due to manual annotation and suffer from limited character diversity, contextual scenarios, and emotional expressiveness. To address these issues, we propose DialogueAgents, a novel hybrid agent-based speech synthesis framework, which integrates three specialized agents -- a script writer, a speech synthesizer, and a dialogue critic -- to collaboratively generate dialogues. Grounded in a diverse character pool, the framework iteratively refines dialogue scripts and synthesizes speech based on speech review, boosting emotional expressiveness and paralinguistic features of the synthesized dialogues. Using DialogueAgent, we contribute MultiTalk, a bilingual, multi-party, multi-turn speech dialogue dataset covering diverse topics. Extensive experiments demonstrate the effectiveness of our framework and the high quality of the MultiTalk dataset. We release the dataset and code https://github.com/uirlx/DialogueAgents to facilitate future research on advanced speech synthesis models and customized data generation.

DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue

TL;DR

DialogueAgents addresses the high cost and limited expressiveness of existing multi-party dialogue data by introducing a hybrid three-agent framework (Script Writer, Speech Synthesizer, Dialogue Critic) that iteratively refines scripts and synthesized speech. The approach leverages a diverse character pool and paralinguistic tokens to enhance emotion and turn-taking, producing the bilingual MultiTalk dataset. Empirical results indicate improved naturalness, emotiveness, and turn-taking with iterative refinement, identifying two iterations as optimal before diminishing returns. The work delivers a scalable data-generation pipeline and a valuable resource for advancing speech synthesis in complex dialogue settings with practical implications for multilingual, multi-speaker applications.

Abstract

Speech synthesis is crucial for human-computer interaction, enabling natural and intuitive communication. However, existing datasets involve high construction costs due to manual annotation and suffer from limited character diversity, contextual scenarios, and emotional expressiveness. To address these issues, we propose DialogueAgents, a novel hybrid agent-based speech synthesis framework, which integrates three specialized agents -- a script writer, a speech synthesizer, and a dialogue critic -- to collaboratively generate dialogues. Grounded in a diverse character pool, the framework iteratively refines dialogue scripts and synthesizes speech based on speech review, boosting emotional expressiveness and paralinguistic features of the synthesized dialogues. Using DialogueAgent, we contribute MultiTalk, a bilingual, multi-party, multi-turn speech dialogue dataset covering diverse topics. Extensive experiments demonstrate the effectiveness of our framework and the high quality of the MultiTalk dataset. We release the dataset and code https://github.com/uirlx/DialogueAgents to facilitate future research on advanced speech synthesis models and customized data generation.

Paper Structure

This paper contains 18 sections, 4 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The overall framework of DialogueAgents.
  • Figure 2: Ablation of critic agent.
  • Figure 3: Distribution of the dataset by topics.