Table of Contents
Fetching ...

LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems

Hao Zhang, Weiwei Li, Rilin Chen, Vinay Kothapally, Meng Yu, Dong Yu

TL;DR

The paper tackles real-time full-duplex spoken dialogue systems by introducing a semantic voice activity detector that acts as a dialogue manager. It deploys a lightweight fine-tuned LLM to output four control tokens that govern turn-taking and distinguish intentional versus unintentional barge-ins, while keeping the core dialogue engine dormant unless response generation is required. This decouples DM optimization from the CDE, reducing computational load while preserving interaction quality. Experiments on Chinese full-duplex data demonstrate high token-prediction accuracy and strong robustness on real recordings, supporting scalable and efficient deployments.

Abstract

Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD predicts four control tokens to regulate turn-switching and turn-keeping, distinguishing between intentional and unintentional barge-ins while detecting query completion for handling user pauses and hesitations. By processing input speech in short intervals, the semantic VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation, reducing computational overhead. This design allows independent DM optimization without retraining the CDE, balancing interaction accuracy and inference efficiency for scalable, next-generation full-duplex SDS.

LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems

TL;DR

The paper tackles real-time full-duplex spoken dialogue systems by introducing a semantic voice activity detector that acts as a dialogue manager. It deploys a lightweight fine-tuned LLM to output four control tokens that govern turn-taking and distinguish intentional versus unintentional barge-ins, while keeping the core dialogue engine dormant unless response generation is required. This decouples DM optimization from the CDE, reducing computational load while preserving interaction quality. Experiments on Chinese full-duplex data demonstrate high token-prediction accuracy and strong robustness on real recordings, supporting scalable and efficient deployments.

Abstract

Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD predicts four control tokens to regulate turn-switching and turn-keeping, distinguishing between intentional and unintentional barge-ins while detecting query completion for handling user pauses and hesitations. By processing input speech in short intervals, the semantic VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation, reducing computational overhead. This design allows independent DM optimization without retraining the CDE, balancing interaction accuracy and inference efficiency for scalable, next-generation full-duplex SDS.

Paper Structure

This paper contains 14 sections, 1 figure, 3 tables, 1 algorithm.

Figures (1)

  • Figure 1: (a) The proposed system architecture; (b) Control tokens and their corresponding actions; (c) Interactions between the DM and CDE; (d) An example of a full-duplex conversation; and (e) Distribution of interaction scenarios in the generated full-duplex conversation dataset.