Table of Contents
Fetching ...

WavChat: A Survey of Spoken Dialogue Models

Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long Zhou, Shujie Liu, Xize Cheng, Xiaoda Yang, Zehan Wang, Qian Yang, Jian Li, Yidi Jiang, Jingzhen He, Yunfei Chu, Jin Xu, Zhou Zhao

TL;DR

This survey maps the landscape of spoken dialogue models, contrasting cascaded and end-to-end architectures and detailing the core technologies that enable speech-aware dialogue. It synthesizes advances in speech representations, training paradigms, streaming and duplex interactions, and evaluation, while cataloging datasets and public resources. The authors identify key trade-offs between semantic and acoustic representations, discuss multi-stage training and generation strategies, and highlight open challenges such as latency, data availability, and the lack of unified benchmarks. The work aims to guide both academic research and industrial deployment of robust, interactive, multi-modal spoken dialogue systems.

Abstract

Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS), modern spoken dialogue models exhibit greater intelligence. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Moreover, they generate high-quality, multi-turn speech responses with low latency, enabling real-time interaction through simultaneous listening and speaking capability. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems and the underlying technologies. To address this, we have first compiled existing spoken dialogue systems in the chronological order and categorized them into the cascaded and end-to-end paradigms. We then provide an in-depth overview of the core technologies in spoken dialogue models, covering aspects such as speech representation, training paradigm, streaming, duplex, and interaction capabilities. Each section discusses the limitations of these technologies and outlines considerations for future research. Additionally, we present a thorough review of relevant datasets, evaluation metrics, and benchmarks from the perspectives of training and evaluating spoken dialogue systems. We hope this survey will contribute to advancing both academic research and industrial applications in the field of spoken dialogue systems. The related material is available at https://github.com/jishengpeng/WavChat.

WavChat: A Survey of Spoken Dialogue Models

TL;DR

This survey maps the landscape of spoken dialogue models, contrasting cascaded and end-to-end architectures and detailing the core technologies that enable speech-aware dialogue. It synthesizes advances in speech representations, training paradigms, streaming and duplex interactions, and evaluation, while cataloging datasets and public resources. The authors identify key trade-offs between semantic and acoustic representations, discuss multi-stage training and generation strategies, and highlight open challenges such as latency, data availability, and the lack of unified benchmarks. The work aims to guide both academic research and industrial deployment of robust, interactive, multi-modal spoken dialogue systems.

Abstract

Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. Compared to traditional three-tier cascaded spoken dialogue models that comprise speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS), modern spoken dialogue models exhibit greater intelligence. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Moreover, they generate high-quality, multi-turn speech responses with low latency, enabling real-time interaction through simultaneous listening and speaking capability. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems and the underlying technologies. To address this, we have first compiled existing spoken dialogue systems in the chronological order and categorized them into the cascaded and end-to-end paradigms. We then provide an in-depth overview of the core technologies in spoken dialogue models, covering aspects such as speech representation, training paradigm, streaming, duplex, and interaction capabilities. Each section discusses the limitations of these technologies and outlines considerations for future research. Additionally, we present a thorough review of relevant datasets, evaluation metrics, and benchmarks from the perspectives of training and evaluating spoken dialogue systems. We hope this survey will contribute to advancing both academic research and industrial applications in the field of spoken dialogue systems. The related material is available at https://github.com/jishengpeng/WavChat.

Paper Structure

This paper contains 56 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: A timeline of existing spoken dialogue models in recent years. The timeline was established mainly according to the release date (e.g., the submission date to arXiv) of the technical paper for each model. It is worth noting that certain works, such as Westlake-Omni, MooER-Omni, Hertz-dev, SpeechGPT2 and Fish-Agent do not have corresponding published papers. Therefore, we have not included them in the figure. We mark the publicly available model checkpoints in yellow color.
  • Figure 2: A general overview of current spoken dialogue systems. We categorize these systems into two paradigms, cascaded spoken dialogue models and end-to-end spoken dialogue models, based on whether the core language model can directly understand and generate speech representations. Additionally, we provide a visualization of the input and output methods used in different spoken dialogue systems.
  • Figure 3: A general overview about the structure of WavChat
  • Figure 4: An overall demonstration of the functions of the spoken dialogue systems. We describe the ideal capabilities of such systems from nine different perspectives: Text Intelligence, Speech Intelligence, Audio and Music Generation, Audio and Music Understanding, Multilingual Capability, Context Learning, Interaction Capability, Streaming Latency, and Multimodal Capability. Each function is illustrated with corresponding dialogue examples.
  • Figure 5: Categorization Diagram of Spoken Dialogue Model Architectural Paradigms.
  • ...and 3 more figures