Table of Contents
Fetching ...

FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing

Shoutao Guo, Shaolei Zhang, Qingkai Fang, Zhengrui Ma, Min Zhang, Yang Feng

TL;DR

FastLongSpeech addresses the challenge of efficiently processing long-form speech in Large Speech-Language Models by introducing an on-top extractor that compresses long speech representations through an iterative fusion strategy guided by content density and similarity. A two-stage training regimen—CTC-based content density learning followed by dynamic compression training—enables LSLMs to transfer short-speech reasoning abilities to long-speech tasks without requiring long-speech training data. The authors introduce LongSpeech-Eval, a long-context spoken QA benchmark built from LongBench, to evaluate long-speech understanding. Empirical results show strong performance on both short- and long-speech tasks and substantial inference efficiency gains, including major reductions in runtime and compute costs compared to baselines and cascaded pipelines. Overall, FastLongSpeech advances practical long-speech processing for LSLMs, enabling scalable, efficient reasoning over extended audio inputs.

Abstract

The rapid advancement of Large Language Models (LLMs) has spurred significant progress in Large Speech-Language Models (LSLMs), enhancing their capabilities in both speech understanding and generation. While existing LSLMs often concentrate on augmenting speech generation or tackling a diverse array of short-speech tasks, the efficient processing of long-form speech remains a critical yet underexplored challenge. This gap is primarily attributed to the scarcity of long-speech training datasets and the high computational costs associated with long sequences. To address these limitations, we introduce FastLongSpeech, a novel framework designed to extend LSLM capabilities for efficient long-speech processing without necessitating dedicated long-speech training data. FastLongSpeech incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths. To adapt LSLMs for long-speech inputs, it introduces a dynamic compression training approach, which exposes the model to short-speech sequences at varying compression ratios, thereby transferring the capabilities of LSLMs to long-speech tasks. To assess the long-speech capabilities of LSLMs, we develop a long-speech understanding benchmark called LongSpeech-Eval. Experiments show that our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.

FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing

TL;DR

FastLongSpeech addresses the challenge of efficiently processing long-form speech in Large Speech-Language Models by introducing an on-top extractor that compresses long speech representations through an iterative fusion strategy guided by content density and similarity. A two-stage training regimen—CTC-based content density learning followed by dynamic compression training—enables LSLMs to transfer short-speech reasoning abilities to long-speech tasks without requiring long-speech training data. The authors introduce LongSpeech-Eval, a long-context spoken QA benchmark built from LongBench, to evaluate long-speech understanding. Empirical results show strong performance on both short- and long-speech tasks and substantial inference efficiency gains, including major reductions in runtime and compute costs compared to baselines and cascaded pipelines. Overall, FastLongSpeech advances practical long-speech processing for LSLMs, enabling scalable, efficient reasoning over extended audio inputs.

Abstract

The rapid advancement of Large Language Models (LLMs) has spurred significant progress in Large Speech-Language Models (LSLMs), enhancing their capabilities in both speech understanding and generation. While existing LSLMs often concentrate on augmenting speech generation or tackling a diverse array of short-speech tasks, the efficient processing of long-form speech remains a critical yet underexplored challenge. This gap is primarily attributed to the scarcity of long-speech training datasets and the high computational costs associated with long sequences. To address these limitations, we introduce FastLongSpeech, a novel framework designed to extend LSLM capabilities for efficient long-speech processing without necessitating dedicated long-speech training data. FastLongSpeech incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths. To adapt LSLMs for long-speech inputs, it introduces a dynamic compression training approach, which exposes the model to short-speech sequences at varying compression ratios, thereby transferring the capabilities of LSLMs to long-speech tasks. To assess the long-speech capabilities of LSLMs, we develop a long-speech understanding benchmark called LongSpeech-Eval. Experiments show that our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.

Paper Structure

This paper contains 31 sections, 7 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Architecture of FastLongSpeech. The left panel illustrates that FastLongSpeech generates a response based on the input speech and text instruction. The right panel details the iterative fusion strategy, where numbers between adjacent frames denote similarity scores and numbers below frames represent content density.
  • Figure 2: Performance of diverse speech fusion methods in the short-speech spoken QA tasks. The score is derived from the LLM evaluating the quality of responses based on the questions and ground-truth answers. The baseline model utilizes a speech window of 750 frames. For the methods other than Baseline, we regulate the compression ratio by adjusting the target length $L$ of the condensed speech representations. In the experiments, a smaller value of $L$ corresponds to a higher compression ratio. A higher score indicates a better quality of the responses.
  • Figure 3: The prompt template for the LLM to evaluate the response of LSLMs.