Table of Contents
Fetching ...

AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs

Peize He, Zichen Wen, Yubo Wang, Yuxuan Wang, Xiaoqian Liu, Jiajie Huang, Zehui Lei, Zhuangcheng Gu, Xiangqi Jin, Jiabing Yang, Kai Li, Zhifei Liu, Weijia Li, Cunxiang Wang, Conghui He, Linfeng Zhang

TL;DR

AudioMarathon tackles the challenge of long-context audio understanding by introducing a benchmark that combines minute-scale audio inputs (90.0–300.0 s) encoded as 2,250–7,500 tokens with diverse domains (speech, sound, music) and multi-hop reasoning across extended time windows. It assesses 16 LALMs on 10 tasks, measuring both task performance and inference efficiency, and reveals significant performance gaps as context length grows. The study also systematically analyzes efficiency techniques, including token pruning and KV-cache eviction, demonstrating substantial speedups under careful task-aware configurations while highlighting potential risks to temporal coherence. Overall, AudioMarathon provides a rigorous, practical platform to drive development of memory-efficient, temporally aware audio models that can process long-form content more robustly, with significant implications for real-world applications like meetings, podcasts, and extended dialogues.

Abstract

Processing long-form audio is a major challenge for Large Audio Language models (LALMs). These models struggle with the quadratic cost of attention ($O(N^2)$) and with modeling long-range temporal dependencies. Existing audio benchmarks are built mostly from short clips and do not evaluate models in realistic long context settings. To address this gap, we introduce AudioMarathon, a benchmark designed to evaluate both understanding and inference efficiency on long-form audio. AudioMarathon provides a diverse set of tasks built upon three pillars: long-context audio inputs with durations ranging from 90.0 to 300.0 seconds, which correspond to encoded sequences of 2,250 to 7,500 audio tokens, respectively, full domain coverage across speech, sound, and music, and complex reasoning that requires multi-hop inference. We evaluate state-of-the-art LALMs and observe clear performance drops as audio length grows. We also study acceleration techniques and analyze the trade-offs of token pruning and KV cache eviction. The results show large gaps across current LALMs and highlight the need for better temporal reasoning and memory-efficient architectures. We believe AudioMarathon will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.

AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs

TL;DR

AudioMarathon tackles the challenge of long-context audio understanding by introducing a benchmark that combines minute-scale audio inputs (90.0–300.0 s) encoded as 2,250–7,500 tokens with diverse domains (speech, sound, music) and multi-hop reasoning across extended time windows. It assesses 16 LALMs on 10 tasks, measuring both task performance and inference efficiency, and reveals significant performance gaps as context length grows. The study also systematically analyzes efficiency techniques, including token pruning and KV-cache eviction, demonstrating substantial speedups under careful task-aware configurations while highlighting potential risks to temporal coherence. Overall, AudioMarathon provides a rigorous, practical platform to drive development of memory-efficient, temporally aware audio models that can process long-form content more robustly, with significant implications for real-world applications like meetings, podcasts, and extended dialogues.

Abstract

Processing long-form audio is a major challenge for Large Audio Language models (LALMs). These models struggle with the quadratic cost of attention () and with modeling long-range temporal dependencies. Existing audio benchmarks are built mostly from short clips and do not evaluate models in realistic long context settings. To address this gap, we introduce AudioMarathon, a benchmark designed to evaluate both understanding and inference efficiency on long-form audio. AudioMarathon provides a diverse set of tasks built upon three pillars: long-context audio inputs with durations ranging from 90.0 to 300.0 seconds, which correspond to encoded sequences of 2,250 to 7,500 audio tokens, respectively, full domain coverage across speech, sound, and music, and complex reasoning that requires multi-hop inference. We evaluate state-of-the-art LALMs and observe clear performance drops as audio length grows. We also study acceleration techniques and analyze the trade-offs of token pruning and KV cache eviction. The results show large gaps across current LALMs and highlight the need for better temporal reasoning and memory-efficient architectures. We believe AudioMarathon will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.

Paper Structure

This paper contains 23 sections, 23 figures, 8 tables.

Figures (23)

  • Figure 1: Overview of the AudioMarathon. AudioMarathon extends short audio clips to long-form audio with a diverse range of task categories, offering a comprehensive and practical assessment of audio intelligence in real-world scenarios.
  • Figure 2: The six-stage data pipeline for constructing the AudioMarathon
  • Figure 3: Per dataset duration and average length in AudioMarathon
  • Figure 4: Task composition of AudioMarathon by category
  • Figure 5: Comparisons of latency and performance trade-off for the Qwen2.5-Omni-3B model under different token pruning strategies across four representative datasets. Frame consistently outperforms other methods across different latency constraints.
  • ...and 18 more figures