Enabling Real-Time Conversations with Minimal Training Costs

Wang Xu; Shuo Wang; Weilin Zhao; Xu Han; Yukun Yan; Yudi Zhang; Zhe Tao; Zhiyuan Liu; Wanxiang Che

Enabling Real-Time Conversations with Minimal Training Costs

Wang Xu, Shuo Wang, Weilin Zhao, Xu Han, Yukun Yan, Yudi Zhang, Zhe Tao, Zhiyuan Liu, Wanxiang Che

TL;DR

The paper tackles the challenge of real-time, fluid conversational AI by addressing the limitations of traditional turn-based LLM chat systems, which hinder simultaneous listening and generation. It introduces DUO, a duplex decoding approach based on channel-division multiplexing that allows parallel input preprocessing and autoregressive output while requiring only minimal additional training. A small 10K-sample dataset with state-token signals demonstrates the model’s ability to handle both awakening and interrupt interactions, with a focus on maintaining the backbone model’s capabilities. Empirical results show that DUO improves responsiveness and human-likeness with substantially lower training costs than prior duplex methods, enabling more natural, interruptible conversations in real-time applications. This approach potentially broadens real-time AI deployment by reducing computational overhead and facilitating seamless user interactions across dialogue, interruption, and non-query contexts.

Abstract

Large language models (LLMs) have demonstrated the ability to improve human efficiency through conversational interactions. Conventional LLM-powered dialogue systems, operating on a turn-based paradigm, preclude real-time interaction during response generation. To address this limitation, researchers have proposed duplex models. These models can dynamically adapt to user input, facilitating real-time interactive feedback. However, these methods typically require substantial computational resources to acquire the ability. To reduce overhead, this paper presents a new duplex decoding approach that enhances LLMs with duplex ability, requiring minimal additional training. Specifically, our method employs parallel decoding of queries and responses in conversations, effectively implementing a channel-division-multiplexing decoding strategy. Experimental results indicate that our proposed method significantly enhances the naturalness and human-likeness of user-AI interactions with minimal training costs.

Enabling Real-Time Conversations with Minimal Training Costs

TL;DR

Abstract

Paper Structure (13 sections, 6 figures, 1 table)

This paper contains 13 sections, 6 figures, 1 table.

Introduction
Methodology
Parallel Decoding
Channel Transition
Dataset Construction
Experiments
Setup
Main Results
Conclusion
Appendix
Data Construction
Traing Data Example
Related Work

Figures (6)

Figure 1: Top Left: A new decoding branch is established when a user interprets the model's generation. DUO doesn't increase the forward number compared to the standard decoding. Right: The tokens generated by the input and output channels after time step $t_1$ do not attend to each other, despite sharing the same prefix tokens. Left Bottom: Channel transition is activated when the state tokens are predicted.
Figure 2: The comparison result between MiniCPM-Duo and MiniCPM-Duplex on responsiveness, human-likeness, factuality, faithfulness, and overall satisfaction.
Figure 3: Case study. The black text denotes the predicted text in the input channel.
Figure 4: The prompt used for data construction.
Figure 5: The training data example of MiniCPM-Duplex.
...and 1 more figures

Enabling Real-Time Conversations with Minimal Training Costs

TL;DR

Abstract

Enabling Real-Time Conversations with Minimal Training Costs

Authors

TL;DR

Abstract

Table of Contents

Figures (6)