Table of Contents
Fetching ...

SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development

Minghan Wang, Ye Bai, Yuxia Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari

TL;DR

SpeechDialogueFactory tackles the scarcity and cost of high-quality speech dialogue data for Speech-LLMs by delivering an end-to-end, production-ready pipeline that combines metadata-driven content creation, scripting, single-pass dialogue simulation, and advanced speech synthesis with voice cloning. It introduces a multi-stage content generation and rigorous quality evaluation framework, including paralinguistic annotations and a multi-faceted LLM-based quality judge, integrated with scalable UI and parallel processing. The framework yields multilingual English-Chinese datasets with extensive speaker diversity and demonstrates comparable or superior quality to human recordings at a fraction of the cost, validated through comprehensive content and speech evaluations. This work provides an open-source toolkit and sample datasets to accelerate Speech-LLM development, with practical impact in rapid dataset generation, customization, and reproducibility for researchers and industry practitioners.

Abstract

High-quality speech dialogue datasets are crucial for Speech-LLM development, yet existing acquisition methods face significant limitations. Human recordings incur high costs and privacy concerns, while synthetic approaches often lack conversational authenticity. To address these challenges, we introduce \textsc{SpeechDialogueFactory}, a production-ready framework for generating natural speech dialogues efficiently. Our solution employs a comprehensive pipeline including metadata generation, dialogue scripting, paralinguistic-enriched utterance simulation, and natural speech synthesis with voice cloning. Additionally, the system provides an interactive UI for detailed sample inspection and a high-throughput batch synthesis mode. Evaluations show that dialogues generated by our system achieve a quality comparable to human recordings while significantly reducing production costs. We release our work as an open-source toolkit, alongside example datasets available in English and Chinese, empowering researchers and developers in Speech-LLM research and development.

SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development

TL;DR

SpeechDialogueFactory tackles the scarcity and cost of high-quality speech dialogue data for Speech-LLMs by delivering an end-to-end, production-ready pipeline that combines metadata-driven content creation, scripting, single-pass dialogue simulation, and advanced speech synthesis with voice cloning. It introduces a multi-stage content generation and rigorous quality evaluation framework, including paralinguistic annotations and a multi-faceted LLM-based quality judge, integrated with scalable UI and parallel processing. The framework yields multilingual English-Chinese datasets with extensive speaker diversity and demonstrates comparable or superior quality to human recordings at a fraction of the cost, validated through comprehensive content and speech evaluations. This work provides an open-source toolkit and sample datasets to accelerate Speech-LLM development, with practical impact in rapid dataset generation, customization, and reproducibility for researchers and industry practitioners.

Abstract

High-quality speech dialogue datasets are crucial for Speech-LLM development, yet existing acquisition methods face significant limitations. Human recordings incur high costs and privacy concerns, while synthetic approaches often lack conversational authenticity. To address these challenges, we introduce \textsc{SpeechDialogueFactory}, a production-ready framework for generating natural speech dialogues efficiently. Our solution employs a comprehensive pipeline including metadata generation, dialogue scripting, paralinguistic-enriched utterance simulation, and natural speech synthesis with voice cloning. Additionally, the system provides an interactive UI for detailed sample inspection and a high-throughput batch synthesis mode. Evaluations show that dialogues generated by our system achieve a quality comparable to human recordings while significantly reducing production costs. We release our work as an open-source toolkit, alongside example datasets available in English and Chinese, empowering researchers and developers in Speech-LLM research and development.

Paper Structure

This paper contains 30 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Example dialogue created by SpeechDialogueFactory with comprehensive character metadata and paralinguistic annotations.
  • Figure 2: The SpeechDialogueFactory pipeline with integrated quality control. The framework processes user-specified settings through three main stages: content generation (including metadata generation, dialogue scripting, and simulation), speech generation (speaker retrieval and TTS synthesis), and quality evaluation at both text and speech levels to filter low-quality outputs before proceeding to subsequent processing steps.
  • Figure 3: User Interface of SpeechDialogueFactory. The main interface of SpeechDialogueFactory consists of 3 tabs: Single Sample Generation (<- left), Batch Samples Generation (-> right), and Sample Inspection. In the Single Sample Generation tab tab (<- left), the System Output section (highlighted in green) displays the complete results, including intermediate outputs such as Metadata, Scripts, Dialogue Audio and Quality Scores (note: screenshot content is simplified for clarity). In the Batch Samples Generation tab (-> right), the System Output (highlighted in green) provides a pre-generated command line, allowing users to easily copy and paste it into their terminal to initiate batch dialogue generation. The purpose of the Sample Inspection tab is to provide a convenient way to inspect individual dialogue samples generated using the batch method. The display is similar to the one described in the Single Sample Generation tab; thus, we will not present it separately here.