Table of Contents
Fetching ...

Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model

Haibin Wu, Yuxuan Hu, Ruchao Fan, Xiaofei Wang, Kenichi Kumatani, Bo Ren, Jianwei Yu, Heng Lu, Lijuan Wang, Yao Qian, Jinyu Li

TL;DR

The paper tackles the challenge of efficient joint speech-text decoding within a single Speech LM by conducting a fair, controlled comparison of decoding paradigms and introducing an accelerated early-stop interleaved (ESI) approach. It demonstrates that interleaved decoding offers the best alignment between modalities, while ESI significantly reduces sequence length (about 25%) and preserves or slightly improves performance. Additionally, carefully curating high-quality speech QA datasets substantially boosts speech QA capabilities. The work provides practical guidance for deploying real-time speech dialogue systems and highlights the importance of data quality in multimodal speech-language modeling.

Abstract

Speech language models (Speech LMs) enable end-to-end speech-text modeling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies, including the interleaved, and parallel generation paradigms, under a controlled experimental setup using the same base language model, speech tokenizer and training data. Our results show that the interleaved approach achieves the best alignment. However it suffers from slow inference due to long token sequence length. To address this, we propose a novel early-stop interleaved (ESI) pattern that not only significantly accelerates decoding but also yields slightly better performance. Additionally, we curate high-quality question answering (QA) datasets to further improve speech QA performance.

Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model

TL;DR

The paper tackles the challenge of efficient joint speech-text decoding within a single Speech LM by conducting a fair, controlled comparison of decoding paradigms and introducing an accelerated early-stop interleaved (ESI) approach. It demonstrates that interleaved decoding offers the best alignment between modalities, while ESI significantly reduces sequence length (about 25%) and preserves or slightly improves performance. Additionally, carefully curating high-quality speech QA datasets substantially boosts speech QA capabilities. The work provides practical guidance for deploying real-time speech dialogue systems and highlights the importance of data quality in multimodal speech-language modeling.

Abstract

Speech language models (Speech LMs) enable end-to-end speech-text modeling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies, including the interleaved, and parallel generation paradigms, under a controlled experimental setup using the same base language model, speech tokenizer and training data. Our results show that the interleaved approach achieves the best alignment. However it suffers from slow inference due to long token sequence length. To address this, we propose a novel early-stop interleaved (ESI) pattern that not only significantly accelerates decoding but also yields slightly better performance. Additionally, we curate high-quality question answering (QA) datasets to further improve speech QA performance.

Paper Structure

This paper contains 20 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Decoding patterns. (a). Interleaved pattern; (b). Parallel pattern; (c). Thinker-Talker pattern. (b) and (c) use LM to decode both text and speech tokens, while (c) uses the LM to decode only text tokens. We refer to chen2024slam to draw the figure.
  • Figure 2: The proposed early stop decoding paradigm to accelerate the interleaved pattern.