Table of Contents
Fetching ...

Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models

Haoqin Sun, Chenyang Lyu, Shiwan Zhao, Xuanfan Ni, Xiangyu Kong, Longyue Wang, Weihua Luo, Yong Qin

TL;DR

Speech-XL tackles the bottleneck of long-form speech understanding in large speech-language models by introducing Speech Summarization Tokens (SSTs) that compress local speech content into KV states, dramatically reducing memory and computation. The method partitions long audio into intervals and interleaves SSTs, with a compression ratio $\alpha_i$ controlling the density of SSTs; KV pruning is complemented by a curriculum that gradually increases compression. The model employs a three-stage training regime (ASR alignment, Q&A, and SST compression) and demonstrates competitive performance on LongSpeech while preserving essential acoustic and semantic cues, with robust efficiency gains. These results indicate that SST-based KV-space compression can extend end-to-end speech understanding to multi-minute inputs, offering scalable, data-efficient long-form processing and a path toward stronger generalization as foundational models evolve.

Abstract

Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.

Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models

TL;DR

Speech-XL tackles the bottleneck of long-form speech understanding in large speech-language models by introducing Speech Summarization Tokens (SSTs) that compress local speech content into KV states, dramatically reducing memory and computation. The method partitions long audio into intervals and interleaves SSTs, with a compression ratio controlling the density of SSTs; KV pruning is complemented by a curriculum that gradually increases compression. The model employs a three-stage training regime (ASR alignment, Q&A, and SST compression) and demonstrates competitive performance on LongSpeech while preserving essential acoustic and semantic cues, with robust efficiency gains. These results indicate that SST-based KV-space compression can extend end-to-end speech understanding to multi-minute inputs, offering scalable, data-efficient long-form processing and a path toward stronger generalization as foundational models evolve.

Abstract

Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.
Paper Structure (19 sections, 5 equations, 5 figures, 5 tables)

This paper contains 19 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Training process of our Speech-XL framework. Phase 1: Semantic Alignment, which utilizes ASR tasks to bridge speech features and textual semantics. Phase 2: Modal Alignment, which employs speech captioning to refine the integration of speech context and linguistic descriptions.
  • Figure 2: Training process of our Speech-XL framework. Phase 3: Long-form audio is encoded and partitioned into fixed intervals, where local information is distilled into SST KV states. This mechanism enables Speech-XL to perceive and comprehend exceptionally long speech streams.
  • Figure 3: (a) shows the ranking for the Speech Content Extraction task. (b) shows the ranking for Speaker Information Modeling. (c) presents the overall ranking combining both tasks.
  • Figure 4: Comparison of the memory usage and the forward FLOPs between ours and Qwen2.5-Omni-7B.
  • Figure 5: Performance of various compression methods across different compression ratios. Dashed line represents performance excluding ASR, while the solid line represents the overall performance. For the 16$\times$ ratio, Speech-XL is evaluated through direct inference without fine-tuning.