Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models
Haoqin Sun, Chenyang Lyu, Shiwan Zhao, Xuanfan Ni, Xiangyu Kong, Longyue Wang, Weihua Luo, Yong Qin
TL;DR
Speech-XL tackles the bottleneck of long-form speech understanding in large speech-language models by introducing Speech Summarization Tokens (SSTs) that compress local speech content into KV states, dramatically reducing memory and computation. The method partitions long audio into intervals and interleaves SSTs, with a compression ratio $\alpha_i$ controlling the density of SSTs; KV pruning is complemented by a curriculum that gradually increases compression. The model employs a three-stage training regime (ASR alignment, Q&A, and SST compression) and demonstrates competitive performance on LongSpeech while preserving essential acoustic and semantic cues, with robust efficiency gains. These results indicate that SST-based KV-space compression can extend end-to-end speech understanding to multi-minute inputs, offering scalable, data-efficient long-form processing and a path toward stronger generalization as foundational models evolve.
Abstract
Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.
