DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities

Xiangyu Lu; Wang Xu; Haoyu Wang; Hongyun Zhou; Haiyan Zhao; Conghui Zhu; Tiejun Zhao; Muyun Yang

DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities

Xiangyu Lu, Wang Xu, Haoyu Wang, Hongyun Zhou, Haiyan Zhao, Conghui Zhu, Tiejun Zhao, Muyun Yang

TL;DR

DuplexMamba introduces a real-time, end-to-end multimodal duplex system for speech-to-text conversations built atop the Mamba architecture. By integrating a ConMamba speech encoder, a speech adapter, and a Mamba-based language model, and by leveraging a novel input state discrimination and streaming alignment, the approach achieves duplex and streaming capabilities with fixed-size memory that scales efficiently during inference. Four training stages—multimodal alignment, multimodal instruction tuning, input state discrimination, and streaming alignment—facilitate cross-modal alignment and robust real-time operation, while a duplex decoding strategy with state tokens supports dynamic handling of new inputs and interruptions. Empirical results on ASR and VoiceBench show competitive performance relative to Transformer-based baselines, with superior memory efficiency and effective interruption handling, underscoring the practical impact of fixed-state models for streaming speech interactions.

Abstract

Real-time speech conversation is essential for natural and efficient human-machine interactions, requiring duplex and streaming capabilities. Traditional Transformer-based conversational chatbots operate in a turn-based manner and exhibit quadratic computational complexity that grows as the input size increases. In this paper, we propose DuplexMamba, a Mamba-based end-to-end multimodal duplex model for speech-to-text conversation. DuplexMamba enables simultaneous input processing and output generation, dynamically adjusting to support real-time streaming. Specifically, we develop a Mamba-based speech encoder and adapt it with a Mamba-based language model. Furthermore, we introduce a novel duplex decoding strategy that enables DuplexMamba to process input and generate output simultaneously. Experimental results demonstrate that DuplexMamba successfully implements duplex and streaming capabilities while achieving performance comparable to several recently developed Transformer-based models in automatic speech recognition (ASR) tasks and voice assistant benchmark evaluations. Our code and model are released.

DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities

TL;DR

Abstract

DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)