DeepDialogue: A Multi-Turn Emotionally-Rich Spoken Dialogue Dataset
Alkis Koudounas, Moreno La Quatra, Elena Baralis
TL;DR
DeepDialogue tackles the shortage of emotionally rich, multi-turn, cross-domain dialogue data by introducing a large-scale, multimodal corpus. It uses a four-stage pipeline—domain/emotion setup, LLM-based dialogue generation with emotional progression, hybrid human–LLM quality filtering, and dual speech synthesis (XTTS-v2 with RAVDESS conditioning and Orpheus)—to produce 40,150 high-quality dialogues across 41 domains and 20 emotions, supplemented with over 480 hours of emotion-consistent audio. Key findings show that cross-model interactions improve coherence, concrete domains yield more grounded conversations, and smaller models struggle beyond ~6 turns, while larger models maintain quality longer; the dataset also enables effective speech emotion recognition and transfer to external corpora. This resource advances emotionally intelligent, multimodal dialogue research and provides a benchmark for evaluating text- and speech-based conversational systems across diverse domains, with implications for emotion-aware training and evaluation. It also highlights ethical considerations and biases, guiding future work toward broader language/Cultural representation and richer emotion modeling.
Abstract
Recent advances in conversational AI have demonstrated impressive capabilities in single-turn responses, yet multi-turn dialogues remain challenging for even the most sophisticated language models. Current dialogue datasets are limited in their emotional range, domain diversity, turn depth, and are predominantly text-only, hindering progress in developing more human-like conversational systems across modalities. To address these limitations, we present DeepDialogue, a large-scale multimodal dataset containing 40,150 high-quality multi-turn dialogues spanning 41 domains and incorporating 20 distinct emotions with coherent emotional progressions. Our approach pairs 9 different language models (4B-72B parameters) to generate 65,600 initial conversations, which we then evaluate through a combination of human annotation and LLM-based quality filtering. The resulting dataset reveals fundamental insights: smaller models fail to maintain coherence beyond 6 dialogue turns; concrete domains (e.g., "cars," "travel") yield more meaningful conversations than abstract ones (e.g., "philosophy"); and cross-model interactions produce more coherent dialogues than same-model conversations. A key contribution of DeepDialogue is its speech component, where we synthesize emotion-consistent voices for all 40,150 dialogues, creating the first large-scale open-source multimodal dialogue dataset that faithfully preserves emotional context across multi-turn conversations.
