Table of Contents
Fetching ...

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, Irwin King

TL;DR

TurnGuide tackles the degraded conversational quality of end-to-end FD-SLMs by introducing dynamic turn segmentation and turn-level text–speech interleaving, enabling robust integration of LLM-like semantic guidance into real-time double-channel speech. The method couples a dynamic segmentation/alignment pipeline with two interleaving strategies—channel-wise and turn-based text-speech interleaving—trained on the Fisher dataset with GLM-4-Voice, and demonstrates substantial gains in semantic meaningfulness and turn-taking behaviors over strong baselines. Key contributions include a practical turn alignment framework using VAD/ASR timestamps and a text-guided dialogue modeling scheme that enforces turn boundaries, plus extensive evaluation showing improved GPT-score semantics and Full-Duplex-Bench turn-taking metrics, underscoring TurnGuide’s potential for more natural spoken interactions. The work also discusses limitations such as dataset specificity and evaluation reliance on automated metrics, and points to future extensions to broader datasets, multi-speaker scenarios, and safety-oriented refinements for deployment. Overall, TurnGuide provides a principled, scalable approach to infusing text-based semantic capabilities into real-time speech, with implications for more natural and controllable spoken dialogue systems.

Abstract

Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational turn-taking such as interruptions, backchannels, and overlapping speech. End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions, but their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. Although interleaved text-speech generation could mitigate this degradation, integrating discrete text tokens into continuous double-channel audio streams could disrupt the precise time alignment required for fluid interaction. To address this, we propose TurnGuide, a novel text-speech interleaved generation approach for e2e FD-SLMs that dynamically segments assistant speech into dialogue turns and interleaves turn-level text and speech generation. This approach allows FD-SLMs to integrate the semantic intelligence of LLMs without compromising the natural acoustic flow. Extensive experiments show that TurnGuide not only significantly improves e2e FD-SLMs to produce semantically meaningful, coherent speech but also achieves state-of-the-art performance on various turn-taking events. Demos are available at https://dreamtheater123.github.io/TurnGuide-Demo/. Code will be available at https://github.com/dreamtheater123/TurnGuide.

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

TL;DR

TurnGuide tackles the degraded conversational quality of end-to-end FD-SLMs by introducing dynamic turn segmentation and turn-level text–speech interleaving, enabling robust integration of LLM-like semantic guidance into real-time double-channel speech. The method couples a dynamic segmentation/alignment pipeline with two interleaving strategies—channel-wise and turn-based text-speech interleaving—trained on the Fisher dataset with GLM-4-Voice, and demonstrates substantial gains in semantic meaningfulness and turn-taking behaviors over strong baselines. Key contributions include a practical turn alignment framework using VAD/ASR timestamps and a text-guided dialogue modeling scheme that enforces turn boundaries, plus extensive evaluation showing improved GPT-score semantics and Full-Duplex-Bench turn-taking metrics, underscoring TurnGuide’s potential for more natural spoken interactions. The work also discusses limitations such as dataset specificity and evaluation reliance on automated metrics, and points to future extensions to broader datasets, multi-speaker scenarios, and safety-oriented refinements for deployment. Overall, TurnGuide provides a principled, scalable approach to infusing text-based semantic capabilities into real-time speech, with implications for more natural and controllable spoken dialogue systems.

Abstract

Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational turn-taking such as interruptions, backchannels, and overlapping speech. End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions, but their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. Although interleaved text-speech generation could mitigate this degradation, integrating discrete text tokens into continuous double-channel audio streams could disrupt the precise time alignment required for fluid interaction. To address this, we propose TurnGuide, a novel text-speech interleaved generation approach for e2e FD-SLMs that dynamically segments assistant speech into dialogue turns and interleaves turn-level text and speech generation. This approach allows FD-SLMs to integrate the semantic intelligence of LLMs without compromising the natural acoustic flow. Extensive experiments show that TurnGuide not only significantly improves e2e FD-SLMs to produce semantically meaningful, coherent speech but also achieves state-of-the-art performance on various turn-taking events. Demos are available at https://dreamtheater123.github.io/TurnGuide-Demo/. Code will be available at https://github.com/dreamtheater123/TurnGuide.

Paper Structure

This paper contains 30 sections, 7 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Illustration of the two types of FD-SLMs. Cascaded FD-SLMs utilize a state predictor to guide predefined behaviors, while e2e FD-SLMs capture the complex dynamics of real-world double-channel dialogue data.
  • Figure 2: Illustration of the TurnGuide approach. The first part illustrates the multi-modal turn segmentation and alignment framework, which dynamically segments assistant speech into turns and aligns the text with speech turns. The second part shows the text-guided full duplex dialogue modeling framework with two interleaving strategies.
  • Figure 3: Perplexity evaluation of FD-SLMs across temperatures using Fisher-text-only, GLM-4-Voice, and Llama-3.1-8B-Instruct as evaluation models.
  • Figure 4: Illustration of the Moshi training strategy.
  • Figure 5: Training and validation loss comparison of different training strategies.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Remark 1
  • Remark 2