Table of Contents
Fetching ...

SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation

Haitian Lu, Gaofeng Cheng, Liuping Luo, Leying Zhang, Yanmin Qian, Pengyuan Zhang

TL;DR

SLIDE addresses the gap between semantic coherence and naturalism in spoken dialogue generation by leveraging an LLM to generate textual dialogue and a speech language model to vocalize it using phoneme level duration control. It introduces a two-tower transformer to predict written phoneme durations and conditions a dGSLM on spoken phoneme sequences, thereby preserving turn taking and paralinguistic cues such as laughter and backchannels. Evaluations on the Fisher dataset show significant improvements in semantic coherence, achieving a perplexity near ground truth while maintaining naturalistic turn taking. The work demonstrates a practical hybrid approach that marries textual semantics with speech unit realism for spontaneous dialogues.

Abstract

Recently, ``textless" speech language models (SLMs) based on speech units have made huge progress in generating naturalistic speech, including non-verbal vocalizations. However, the generated speech samples often lack semantic coherence. In this paper, we propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration (SLIDE). Specifically, we first utilize an LLM to generate the textual content of spoken dialogue. Next, we convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme. Finally, an SLM conditioned on the spoken phoneme sequences is used to vocalize the textual dialogue. Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.

SLIDE: Integrating Speech Language Model with LLM for Spontaneous Spoken Dialogue Generation

TL;DR

SLIDE addresses the gap between semantic coherence and naturalism in spoken dialogue generation by leveraging an LLM to generate textual dialogue and a speech language model to vocalize it using phoneme level duration control. It introduces a two-tower transformer to predict written phoneme durations and conditions a dGSLM on spoken phoneme sequences, thereby preserving turn taking and paralinguistic cues such as laughter and backchannels. Evaluations on the Fisher dataset show significant improvements in semantic coherence, achieving a perplexity near ground truth while maintaining naturalistic turn taking. The work demonstrates a practical hybrid approach that marries textual semantics with speech unit realism for spontaneous dialogues.

Abstract

Recently, ``textless" speech language models (SLMs) based on speech units have made huge progress in generating naturalistic speech, including non-verbal vocalizations. However, the generated speech samples often lack semantic coherence. In this paper, we propose SLM and LLM Integration for spontaneous spoken Dialogue gEneration (SLIDE). Specifically, we first utilize an LLM to generate the textual content of spoken dialogue. Next, we convert the textual dialogues into phoneme sequences and use a two-tower transformer-based duration predictor to predict the duration of each phoneme. Finally, an SLM conditioned on the spoken phoneme sequences is used to vocalize the textual dialogue. Experimental results on the Fisher dataset demonstrate that our system can generate naturalistic spoken dialogue while maintaining high semantic coherence.
Paper Structure (13 sections, 2 figures, 3 tables)

This paper contains 13 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The inference diagram of the proposed SLIDE model for spoken dialogue generation, with black representing elements from Channel A and brown representing elements from Channel B. Written phoneme sequence refers to phonemes obtained from grapheme-to-phoneme (G2P) conversion of textual dialogues, with silent phonemes inserted between different sentences. Spoken phoneme sequence extends the phonemes from the written phoneme sequence by repeating them to represent their durations in speech.
  • Figure 2: The temporal distribution of turn-taking events. The green triangles denote the mean values, and the solid lines within the boxes represent the median.