Table of Contents
Fetching ...

OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios

Xize Cheng, Dongjie Fu, Xiaoda Yang, Minghui Fang, Ruofan Hu, Jingyu Lu, Bai Jionghao, Zehan Wang, Shengpeng Ji, Rongjie Huang, Linjun Li, Yu Chen, Tao Jin, Zhou Zhao

TL;DR

This work tackles the scarcity and limited diversity of real-world spoken dialogue data by introducing ShareChatX, a large-scale synthetic dataset spanning emotion, audio events, and music. It presents OmniChat, a multi-turn spoken dialogue system that uses a heterogeneous Mix-Former to fuse multi-modal features from dedicated experts (content, emotion, and non-speech audio) and generate contextually appropriate responses. The paper systematically studies training strategies with synthetic data, finding an optimal balance between synthetic and real data and demonstrating state-of-the-art performance on the real DailyTalk dataset. The results highlight the critical role of synthetic data in enabling robust, emotion-aware dialogue across complex, multimodal scenarios, and the authors provide data and code for reproducibility.

Abstract

With the rapid development of large language models, researchers have created increasingly advanced spoken dialogue systems that can naturally converse with humans. However, these systems still struggle to handle the full complexity of real-world conversations, including audio events, musical contexts, and emotional expressions, mainly because current dialogue datasets are constrained in both scale and scenario diversity. In this paper, we propose leveraging synthetic data to enhance the dialogue models across diverse scenarios. We introduce ShareChatX, the first comprehensive, large-scale dataset for spoken dialogue that spans diverse scenarios. Based on this dataset, we introduce OmniChat, a multi-turn dialogue system with a heterogeneous feature fusion module, designed to optimize feature selection in different dialogue contexts. In addition, we explored critical aspects of training dialogue systems using synthetic data. Through comprehensive experimentation, we determined the ideal balance between synthetic and real data, achieving state-of-the-art results on the real-world dialogue dataset DailyTalk. We also highlight the crucial importance of synthetic data in tackling diverse, complex dialogue scenarios, especially those involving audio and music. For more details, please visit our demo page at \url{https://sharechatx.github.io/}.

OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios

TL;DR

This work tackles the scarcity and limited diversity of real-world spoken dialogue data by introducing ShareChatX, a large-scale synthetic dataset spanning emotion, audio events, and music. It presents OmniChat, a multi-turn spoken dialogue system that uses a heterogeneous Mix-Former to fuse multi-modal features from dedicated experts (content, emotion, and non-speech audio) and generate contextually appropriate responses. The paper systematically studies training strategies with synthetic data, finding an optimal balance between synthetic and real data and demonstrating state-of-the-art performance on the real DailyTalk dataset. The results highlight the critical role of synthetic data in enabling robust, emotion-aware dialogue across complex, multimodal scenarios, and the authors provide data and code for reproducibility.

Abstract

With the rapid development of large language models, researchers have created increasingly advanced spoken dialogue systems that can naturally converse with humans. However, these systems still struggle to handle the full complexity of real-world conversations, including audio events, musical contexts, and emotional expressions, mainly because current dialogue datasets are constrained in both scale and scenario diversity. In this paper, we propose leveraging synthetic data to enhance the dialogue models across diverse scenarios. We introduce ShareChatX, the first comprehensive, large-scale dataset for spoken dialogue that spans diverse scenarios. Based on this dataset, we introduce OmniChat, a multi-turn dialogue system with a heterogeneous feature fusion module, designed to optimize feature selection in different dialogue contexts. In addition, we explored critical aspects of training dialogue systems using synthetic data. Through comprehensive experimentation, we determined the ideal balance between synthetic and real data, achieving state-of-the-art results on the real-world dialogue dataset DailyTalk. We also highlight the crucial importance of synthetic data in tackling diverse, complex dialogue scenarios, especially those involving audio and music. For more details, please visit our demo page at \url{https://sharechatx.github.io/}.
Paper Structure (33 sections, 3 equations, 13 figures, 5 tables)

This paper contains 33 sections, 3 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Overview for Crafting our ShareChatX Dataset. First, text dialogue scripts $T_i=\{T_i^{style},T_i^{content}\}$ are generated using large language models, with data-specific prompts tailored for the three subsets: -emotion, -audio, and -music. Next, spoken dialogue data $S_i$ is synthesized using controllable text-to-speech synthesis model (CosyVoice-Instruct), incorporating style parameters such as gender, pitch, speed, and emotion. To ensure the quality of the generated data, both model-based and manual verification processes are applied. Finally, audio events and music are integrated into the dialogues, with specific methods for handling temporary and continuous sounds.
  • Figure 2: Overview of OmniChat. (a) OmniChat predicts the $t$-th response $\mathbf{T}_{Assis,t}$ by using the previous $t$ dialogues ${\mathbf{D}_{human,1},\cdots,\mathbf{D}_{human,t}}$ and $t-1$ responses ${\mathbf{T}_{Assist,1},\cdots,\mathbf{T}_{Assist,t\text{-}1}}$ as context. OmniChat concurrently predicts both the Style $\mathbf{T}_{Assist,t}^{style}$ and Content $\mathbf{T}_{Assist,t}^{content}$ of the response. (b) Mix-Former leverages Q-Former to independently represent different expert features, thereby enhancing the ability to capture the nuances of each aspect of the dialogue segment.
  • Figure 3: Performance comparison of dialogue systems trained with varying data scales on the ShareChatX-Emotion.T denotes text input, S+T denotes both speech and ASR-transcription input, and S (ours) represents our method utilizing only speech as input. The numbers on the horizontal axis represent the scale of the dialogue data used during training.
  • Figure 4: Performance Comparison of Various Training Strategies on ShareChat-Audio. A-FT refers to training using only the -audio subset, E-PT involves pre-training on the more general -emotion subset, and E-PT+A-FT represents a strategy where the model is first pre-trained on the general -emotion subset, followed by fine-tuning on the -audio subset.
  • Figure 5: The Prompt Template for GPT-eval.
  • ...and 8 more figures