FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems
Borui Liao, Yulong Xu, Jiao Ou, Kaiyuan Yang, Weihua Jian, Pengfei Wan, Di Zhang
TL;DR
FlexDuo addresses the tight coupling and noise interference in current full-duplex SDS by introducing a pluggable full-duplex control module that can interface with existing half-duplex LLMs. Its components are a Context Manager, a State Manager with a seven-state FSM, and a Sliding Window, plus an Idle state for filtering non-dialogue audio; the next action is computed as $\pi_t = F(C, S_{t-1}, W_t: \theta)$ to guide dialogue behavior. On Fisher data, FlexDuo reduces false interruptions by about 23–25% and improves turn-taking by about 7–8% relative to baselines, while achieving lower conditional perplexity and lower latency than VAD-based systems. This modular approach demonstrates a practical path to modular, reusable full-duplex dialogue systems and opens avenues for richer multimodal dialogue control with potential reinforcement learning enhancements.
Abstract
Full-Duplex Speech Dialogue Systems (Full-Duplex SDS) have significantly enhanced the naturalness of human-machine interaction by enabling real-time bidirectional communication. However, existing approaches face challenges such as difficulties in independent module optimization and contextual noise interference due to highly coupled architectural designs and oversimplified binary state modeling. This paper proposes FlexDuo, a flexible full-duplex control module that decouples duplex control from spoken dialogue systems through a plug-and-play architectural design. Furthermore, inspired by human information-filtering mechanisms in conversations, we introduce an explicit Idle state. On one hand, the Idle state filters redundant noise and irrelevant audio to enhance dialogue quality. On the other hand, it establishes a semantic integrity-based buffering mechanism, reducing the risk of mutual interruptions while ensuring accurate response transitions. Experimental results on the Fisher corpus demonstrate that FlexDuo reduces the false interruption rate by 24.9% and improves response accuracy by 7.6% compared to integrated full-duplex dialogue system baselines. It also outperforms voice activity detection (VAD) controlled baseline systems in both Chinese and English dialogue quality. The proposed modular architecture and state-based dialogue model provide a novel technical pathway for building flexible and efficient duplex dialogue systems.
