PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

Kentaro Mitsui; Koh Mitsuda; Toshiaki Wakatsuki; Yukiya Hono; Kei Sawada

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, Kei Sawada

TL;DR

This study extends the input and output sequences of the language model to support the parallel generation of text and speech, and shows that latency can be further reduced by generating speech in multiple sequences.

Abstract

Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response requires the prior generation of a written response, and (2) speech sequences are significantly longer than text sequences. This study addresses these issues by extending the input and output sequences of the language model to support the parallel generation of text and speech. Our experiments on spoken question answering tasks demonstrate that our approach improves latency while maintaining the quality of response content. Additionally, we show that latency can be further reduced by generating speech in multiple sequences. Demo samples are available at https://rinnakk.github.io/research/publications/PSLM.

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

TL;DR

Abstract

Paper Structure (34 sections, 2 equations, 6 figures, 4 tables)

This paper contains 34 sections, 2 equations, 6 figures, 4 tables.

Introduction
PSLM
Speech Discretization
Speech Tokenization
Speech Detokenization
Integrating LMs with a Speech Stream
Introducing Multiple Speech Streams
Streaming Inference with HiFi-GAN
Overall Latency
Experimental Setup
Dataset
Configuration
Tokenization and Detokenization
Language Modeling
Baselines
...and 19 more sections

Figures (6)

Figure 1: (a) Chain-of-Modality prompting necessitates generating text questions (TQ) and text answers (TA) from speech questions (SQ) before producing speech answers (SA). (b) Our Parallel Speech Language Model (PSLM) enables the parallel decoding of TA and SA, reducing overall latency. (c) Introducing multiple speech streams further accelerates the generation of SA.
Figure 2: Architecture of PSLM.
Figure 3: Latency vs. TA length for different methods and tokens per second (TPS). PSLM-2x-ASR (50 TPS) is omitted because its latency is identical to PSLM-ASR (100 TPS).
Figure 4: Streaming inference using HiFi-GAN with receptive field size $R=5$ and SA length $N_\textrm{SA}=6$. Waveform generation begins once $N_\textrm{offset} = \lfloor R / 2 \rfloor + 1 = 3$ tokens are generated. Text tokens are omitted.
Figure 5: Prompt for ChatGPT evaluation.
...and 1 more figures

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

TL;DR

Abstract

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (6)