Table of Contents
Fetching ...

SoulX-Duplug: Plug-and-Play Streaming State Prediction Module for Realtime Full-Duplex Speech Conversation

Ruiqi Yan, Wenxi Chen, Zhanxun Liu, Ziyang Ma, Haopeng Lin, Hanlin Wen, Hanke Xie, Jun Wu, Yuzhe Liang, Yuxiang Zhao, Pengchao Feng, Jiale Qian, Hao Meng, Yuhang Dai, Shunshun Yin, Ming Tao, Lei Xie, Kai Yu, Xinsheng Wang, Xie Chen

Abstract

Recent advances in spoken dialogue systems have brought increased attention to human-like full-duplex voice interactions. However, our comprehensive review of this field reveals several challenges, including the difficulty in obtaining training data, catastrophic forgetting, and limited scalability. In this work, we propose SoulX-Duplug, a plug-and-play streaming state prediction module for full-duplex spoken dialogue systems. By jointly performing streaming ASR, SoulX-Duplug explicitly leverages textual information to identify user intent, effectively serving as a semantic VAD. To promote fair evaluation, we introduce SoulX-Duplug-Eval, extending widely used benchmarks with improved bilingual coverage. Experimental results show that SoulX-Duplug enables low-latency streaming dialogue state control, and the system built upon it outperforms existing full-duplex models in overall turn management and latency performance. We have open-sourced SoulX-Duplug and SoulX-Duplug-Eval.

SoulX-Duplug: Plug-and-Play Streaming State Prediction Module for Realtime Full-Duplex Speech Conversation

Abstract

Recent advances in spoken dialogue systems have brought increased attention to human-like full-duplex voice interactions. However, our comprehensive review of this field reveals several challenges, including the difficulty in obtaining training data, catastrophic forgetting, and limited scalability. In this work, we propose SoulX-Duplug, a plug-and-play streaming state prediction module for full-duplex spoken dialogue systems. By jointly performing streaming ASR, SoulX-Duplug explicitly leverages textual information to identify user intent, effectively serving as a semantic VAD. To promote fair evaluation, we introduce SoulX-Duplug-Eval, extending widely used benchmarks with improved bilingual coverage. Experimental results show that SoulX-Duplug enables low-latency streaming dialogue state control, and the system built upon it outperforms existing full-duplex models in overall turn management and latency performance. We have open-sourced SoulX-Duplug and SoulX-Duplug-Eval.
Paper Structure (28 sections, 4 equations, 6 figures, 5 tables)

This paper contains 28 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of state-driven modular full-duplex speech interaction: the state prediction module processes incoming audio and predicts dialogue states, while the half-duplex SDM generates or stops speech accordingly.
  • Figure 2: Overview of Researches on Full-Duplex Spoken Dialogue Systems.
  • Figure 3: Illustration of 3 types of full-duplex spoken dialogue systems. (a): End-to-End Continuous-Output Full-Duplex Models. (b): End-to-End State-Driven Full-Duplex Models. (c): Modular State-Driven Full-Duplex Systems.
  • Figure 4: The architecture of SoulX-Duplug. The model runs with interleaved audio tokens, text tokens, and state tokens, ensuring that textual information is available when predicting state tokens. During training, VAD, ASR, and state prediction are end-to-end optimized. During inference, a lightweight state-of-the-art ASR model provides text guidance via teacher forcing.
  • Figure 5: A detailed example explaining the state token design of SoulX-Duplug and its streaming inference paradigm.
  • ...and 1 more figures