Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model
Haibin Wu, Yuxuan Hu, Ruchao Fan, Xiaofei Wang, Kenichi Kumatani, Bo Ren, Jianwei Yu, Heng Lu, Lijuan Wang, Yao Qian, Jinyu Li
TL;DR
The paper tackles the challenge of efficient joint speech-text decoding within a single Speech LM by conducting a fair, controlled comparison of decoding paradigms and introducing an accelerated early-stop interleaved (ESI) approach. It demonstrates that interleaved decoding offers the best alignment between modalities, while ESI significantly reduces sequence length (about 25%) and preserves or slightly improves performance. Additionally, carefully curating high-quality speech QA datasets substantially boosts speech QA capabilities. The work provides practical guidance for deploying real-time speech dialogue systems and highlights the importance of data quality in multimodal speech-language modeling.
Abstract
Speech language models (Speech LMs) enable end-to-end speech-text modeling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies, including the interleaved, and parallel generation paradigms, under a controlled experimental setup using the same base language model, speech tokenizer and training data. Our results show that the interleaved approach achieves the best alignment. However it suffers from slow inference due to long token sequence length. To address this, we propose a novel early-stop interleaved (ESI) pattern that not only significantly accelerates decoding but also yields slightly better performance. Additionally, we curate high-quality question answering (QA) datasets to further improve speech QA performance.
