Table of Contents
Fetching ...

DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities

Xiangyu Lu, Wang Xu, Haoyu Wang, Hongyun Zhou, Haiyan Zhao, Conghui Zhu, Tiejun Zhao, Muyun Yang

TL;DR

DuplexMamba introduces a real-time, end-to-end multimodal duplex system for speech-to-text conversations built atop the Mamba architecture. By integrating a ConMamba speech encoder, a speech adapter, and a Mamba-based language model, and by leveraging a novel input state discrimination and streaming alignment, the approach achieves duplex and streaming capabilities with fixed-size memory that scales efficiently during inference. Four training stages—multimodal alignment, multimodal instruction tuning, input state discrimination, and streaming alignment—facilitate cross-modal alignment and robust real-time operation, while a duplex decoding strategy with state tokens supports dynamic handling of new inputs and interruptions. Empirical results on ASR and VoiceBench show competitive performance relative to Transformer-based baselines, with superior memory efficiency and effective interruption handling, underscoring the practical impact of fixed-state models for streaming speech interactions.

Abstract

Real-time speech conversation is essential for natural and efficient human-machine interactions, requiring duplex and streaming capabilities. Traditional Transformer-based conversational chatbots operate in a turn-based manner and exhibit quadratic computational complexity that grows as the input size increases. In this paper, we propose DuplexMamba, a Mamba-based end-to-end multimodal duplex model for speech-to-text conversation. DuplexMamba enables simultaneous input processing and output generation, dynamically adjusting to support real-time streaming. Specifically, we develop a Mamba-based speech encoder and adapt it with a Mamba-based language model. Furthermore, we introduce a novel duplex decoding strategy that enables DuplexMamba to process input and generate output simultaneously. Experimental results demonstrate that DuplexMamba successfully implements duplex and streaming capabilities while achieving performance comparable to several recently developed Transformer-based models in automatic speech recognition (ASR) tasks and voice assistant benchmark evaluations. Our code and model are released.

DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities

TL;DR

DuplexMamba introduces a real-time, end-to-end multimodal duplex system for speech-to-text conversations built atop the Mamba architecture. By integrating a ConMamba speech encoder, a speech adapter, and a Mamba-based language model, and by leveraging a novel input state discrimination and streaming alignment, the approach achieves duplex and streaming capabilities with fixed-size memory that scales efficiently during inference. Four training stages—multimodal alignment, multimodal instruction tuning, input state discrimination, and streaming alignment—facilitate cross-modal alignment and robust real-time operation, while a duplex decoding strategy with state tokens supports dynamic handling of new inputs and interruptions. Empirical results on ASR and VoiceBench show competitive performance relative to Transformer-based baselines, with superior memory efficiency and effective interruption handling, underscoring the practical impact of fixed-state models for streaming speech interactions.

Abstract

Real-time speech conversation is essential for natural and efficient human-machine interactions, requiring duplex and streaming capabilities. Traditional Transformer-based conversational chatbots operate in a turn-based manner and exhibit quadratic computational complexity that grows as the input size increases. In this paper, we propose DuplexMamba, a Mamba-based end-to-end multimodal duplex model for speech-to-text conversation. DuplexMamba enables simultaneous input processing and output generation, dynamically adjusting to support real-time streaming. Specifically, we develop a Mamba-based speech encoder and adapt it with a Mamba-based language model. Furthermore, we introduce a novel duplex decoding strategy that enables DuplexMamba to process input and generate output simultaneously. Experimental results demonstrate that DuplexMamba successfully implements duplex and streaming capabilities while achieving performance comparable to several recently developed Transformer-based models in automatic speech recognition (ASR) tasks and voice assistant benchmark evaluations. Our code and model are released.

Paper Structure

This paper contains 41 sections, 3 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The model architecture of DuplexMamba.
  • Figure 2: The four-stage training of DuplexMamba.
  • Figure 3: The duplex decoding strategy of DuplexMamba. "IC" is short for the "<incomplete>" token, "RE" for the "<response>" token, and "IG" for the "<ignore>" token. "O1" and "O2" represent the output tokens for query 1 and query 2, respectively. Due to the fixed state size in Mamba-based models, creating an auxiliary branch simply involves duplicating the model's current state.
  • Figure 4: GPU memory usage of DuplexMamba and Qwen2-Audio across different context lengths.
  • Figure 5: Cases of interruption interaction and non-awakening interaction. The model predicts the state token for each user input.
  • ...and 1 more figures