SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development
Minghan Wang, Ye Bai, Yuxia Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari
TL;DR
SpeechDialogueFactory tackles the scarcity and cost of high-quality speech dialogue data for Speech-LLMs by delivering an end-to-end, production-ready pipeline that combines metadata-driven content creation, scripting, single-pass dialogue simulation, and advanced speech synthesis with voice cloning. It introduces a multi-stage content generation and rigorous quality evaluation framework, including paralinguistic annotations and a multi-faceted LLM-based quality judge, integrated with scalable UI and parallel processing. The framework yields multilingual English-Chinese datasets with extensive speaker diversity and demonstrates comparable or superior quality to human recordings at a fraction of the cost, validated through comprehensive content and speech evaluations. This work provides an open-source toolkit and sample datasets to accelerate Speech-LLM development, with practical impact in rapid dataset generation, customization, and reproducibility for researchers and industry practitioners.
Abstract
High-quality speech dialogue datasets are crucial for Speech-LLM development, yet existing acquisition methods face significant limitations. Human recordings incur high costs and privacy concerns, while synthetic approaches often lack conversational authenticity. To address these challenges, we introduce \textsc{SpeechDialogueFactory}, a production-ready framework for generating natural speech dialogues efficiently. Our solution employs a comprehensive pipeline including metadata generation, dialogue scripting, paralinguistic-enriched utterance simulation, and natural speech synthesis with voice cloning. Additionally, the system provides an interactive UI for detailed sample inspection and a high-throughput batch synthesis mode. Evaluations show that dialogues generated by our system achieve a quality comparable to human recordings while significantly reducing production costs. We release our work as an open-source toolkit, alongside example datasets available in English and Chinese, empowering researchers and developers in Speech-LLM research and development.
