Table of Contents
Fetching ...

Enabling Real-Time Conversations with Minimal Training Costs

Wang Xu, Shuo Wang, Weilin Zhao, Xu Han, Yukun Yan, Yudi Zhang, Zhe Tao, Zhiyuan Liu, Wanxiang Che

TL;DR

The paper tackles the challenge of real-time, fluid conversational AI by addressing the limitations of traditional turn-based LLM chat systems, which hinder simultaneous listening and generation. It introduces DUO, a duplex decoding approach based on channel-division multiplexing that allows parallel input preprocessing and autoregressive output while requiring only minimal additional training. A small 10K-sample dataset with state-token signals demonstrates the model’s ability to handle both awakening and interrupt interactions, with a focus on maintaining the backbone model’s capabilities. Empirical results show that DUO improves responsiveness and human-likeness with substantially lower training costs than prior duplex methods, enabling more natural, interruptible conversations in real-time applications. This approach potentially broadens real-time AI deployment by reducing computational overhead and facilitating seamless user interactions across dialogue, interruption, and non-query contexts.

Abstract

Large language models (LLMs) have demonstrated the ability to improve human efficiency through conversational interactions. Conventional LLM-powered dialogue systems, operating on a turn-based paradigm, preclude real-time interaction during response generation. To address this limitation, researchers have proposed duplex models. These models can dynamically adapt to user input, facilitating real-time interactive feedback. However, these methods typically require substantial computational resources to acquire the ability. To reduce overhead, this paper presents a new duplex decoding approach that enhances LLMs with duplex ability, requiring minimal additional training. Specifically, our method employs parallel decoding of queries and responses in conversations, effectively implementing a channel-division-multiplexing decoding strategy. Experimental results indicate that our proposed method significantly enhances the naturalness and human-likeness of user-AI interactions with minimal training costs.

Enabling Real-Time Conversations with Minimal Training Costs

TL;DR

The paper tackles the challenge of real-time, fluid conversational AI by addressing the limitations of traditional turn-based LLM chat systems, which hinder simultaneous listening and generation. It introduces DUO, a duplex decoding approach based on channel-division multiplexing that allows parallel input preprocessing and autoregressive output while requiring only minimal additional training. A small 10K-sample dataset with state-token signals demonstrates the model’s ability to handle both awakening and interrupt interactions, with a focus on maintaining the backbone model’s capabilities. Empirical results show that DUO improves responsiveness and human-likeness with substantially lower training costs than prior duplex methods, enabling more natural, interruptible conversations in real-time applications. This approach potentially broadens real-time AI deployment by reducing computational overhead and facilitating seamless user interactions across dialogue, interruption, and non-query contexts.

Abstract

Large language models (LLMs) have demonstrated the ability to improve human efficiency through conversational interactions. Conventional LLM-powered dialogue systems, operating on a turn-based paradigm, preclude real-time interaction during response generation. To address this limitation, researchers have proposed duplex models. These models can dynamically adapt to user input, facilitating real-time interactive feedback. However, these methods typically require substantial computational resources to acquire the ability. To reduce overhead, this paper presents a new duplex decoding approach that enhances LLMs with duplex ability, requiring minimal additional training. Specifically, our method employs parallel decoding of queries and responses in conversations, effectively implementing a channel-division-multiplexing decoding strategy. Experimental results indicate that our proposed method significantly enhances the naturalness and human-likeness of user-AI interactions with minimal training costs.
Paper Structure (13 sections, 6 figures, 1 table)

This paper contains 13 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Top Left: A new decoding branch is established when a user interprets the model's generation. DUO doesn't increase the forward number compared to the standard decoding. Right: The tokens generated by the input and output channels after time step $t_1$ do not attend to each other, despite sharing the same prefix tokens. Left Bottom: Channel transition is activated when the state tokens are predicted.
  • Figure 2: The comparison result between MiniCPM-Duo and MiniCPM-Duplex on responsiveness, human-likeness, factuality, faithfulness, and overall satisfaction.
  • Figure 3: Case study. The black text denotes the predicted text in the input channel.
  • Figure 4: The prompt used for data construction.
  • Figure 5: The training data example of MiniCPM-Duplex.
  • ...and 1 more figures