Enabling Real-Time Conversations with Minimal Training Costs
Wang Xu, Shuo Wang, Weilin Zhao, Xu Han, Yukun Yan, Yudi Zhang, Zhe Tao, Zhiyuan Liu, Wanxiang Che
TL;DR
The paper tackles the challenge of real-time, fluid conversational AI by addressing the limitations of traditional turn-based LLM chat systems, which hinder simultaneous listening and generation. It introduces DUO, a duplex decoding approach based on channel-division multiplexing that allows parallel input preprocessing and autoregressive output while requiring only minimal additional training. A small 10K-sample dataset with state-token signals demonstrates the model’s ability to handle both awakening and interrupt interactions, with a focus on maintaining the backbone model’s capabilities. Empirical results show that DUO improves responsiveness and human-likeness with substantially lower training costs than prior duplex methods, enabling more natural, interruptible conversations in real-time applications. This approach potentially broadens real-time AI deployment by reducing computational overhead and facilitating seamless user interactions across dialogue, interruption, and non-query contexts.
Abstract
Large language models (LLMs) have demonstrated the ability to improve human efficiency through conversational interactions. Conventional LLM-powered dialogue systems, operating on a turn-based paradigm, preclude real-time interaction during response generation. To address this limitation, researchers have proposed duplex models. These models can dynamically adapt to user input, facilitating real-time interactive feedback. However, these methods typically require substantial computational resources to acquire the ability. To reduce overhead, this paper presents a new duplex decoding approach that enhances LLMs with duplex ability, requiring minimal additional training. Specifically, our method employs parallel decoding of queries and responses in conversations, effectively implementing a channel-division-multiplexing decoding strategy. Experimental results indicate that our proposed method significantly enhances the naturalness and human-likeness of user-AI interactions with minimal training costs.
