SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation
Haitian Lu, Gaofeng Cheng, Liuping Luo, Leying Zhang, Yanmin Qian, Pengyuan Zhang
TL;DR
SLIDE addresses the gap between semantic coherence and naturalism in spoken dialogue generation by leveraging an LLM to generate textual dialogue and a speech language model to vocalize it using phoneme level duration control. It introduces a two-tower transformer to predict written phoneme durations and conditions a dGSLM on spoken phoneme sequences, thereby preserving turn taking and paralinguistic cues such as laughter and backchannels. Evaluations on the Fisher dataset show significant improvements in semantic coherence, achieving a perplexity near ground truth while maintaining naturalistic turn taking. The work demonstrates a practical hybrid approach that marries textual semantics with speech unit realism for spontaneous dialogues.
Abstract
Recently, ``textless" speech language models (SLMs) based on speech units have made huge progress in generating naturalistic speech, including non-verbal vocalizations. However, the generated speech samples often lack semantic coherence. In this paper, we propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration (SLIDE). Specifically, we first utilize an LLM to generate the textual content of spoken dialogue. Next, we convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme. Finally, an SLM conditioned on the spoken phoneme sequences is used to vocalize the textual dialogue. Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.
