TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving
Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, Irwin King
TL;DR
TurnGuide tackles the degraded conversational quality of end-to-end FD-SLMs by introducing dynamic turn segmentation and turn-level text–speech interleaving, enabling robust integration of LLM-like semantic guidance into real-time double-channel speech. The method couples a dynamic segmentation/alignment pipeline with two interleaving strategies—channel-wise and turn-based text-speech interleaving—trained on the Fisher dataset with GLM-4-Voice, and demonstrates substantial gains in semantic meaningfulness and turn-taking behaviors over strong baselines. Key contributions include a practical turn alignment framework using VAD/ASR timestamps and a text-guided dialogue modeling scheme that enforces turn boundaries, plus extensive evaluation showing improved GPT-score semantics and Full-Duplex-Bench turn-taking metrics, underscoring TurnGuide’s potential for more natural spoken interactions. The work also discusses limitations such as dataset specificity and evaluation reliance on automated metrics, and points to future extensions to broader datasets, multi-speaker scenarios, and safety-oriented refinements for deployment. Overall, TurnGuide provides a principled, scalable approach to infusing text-based semantic capabilities into real-time speech, with implications for more natural and controllable spoken dialogue systems.
Abstract
Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational turn-taking such as interruptions, backchannels, and overlapping speech. End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions, but their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. Although interleaved text-speech generation could mitigate this degradation, integrating discrete text tokens into continuous double-channel audio streams could disrupt the precise time alignment required for fluid interaction. To address this, we propose TurnGuide, a novel text-speech interleaved generation approach for e2e FD-SLMs that dynamically segments assistant speech into dialogue turns and interleaves turn-level text and speech generation. This approach allows FD-SLMs to integrate the semantic intelligence of LLMs without compromising the natural acoustic flow. Extensive experiments show that TurnGuide not only significantly improves e2e FD-SLMs to produce semantically meaningful, coherent speech but also achieves state-of-the-art performance on various turn-taking events. Demos are available at https://dreamtheater123.github.io/TurnGuide-Demo/. Code will be available at https://github.com/dreamtheater123/TurnGuide.
