Table of Contents
Fetching ...

Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy

Geewook Kim, Minjoon Seo

TL;DR

Across 10 benchmarks -- with and without filtering -- audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected.

Abstract

Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines -- not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers ~77% of AVQA without audio, suggesting poor measurement of audio-visual reasoning. Building on LLaVA-OneVision, we attach a speech/audio encoder and compare five compressor architectures under 25x token reduction (25 Hz to 1 Hz). Across 10 benchmarks -- with and without filtering -- audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected. Our results show that speech encoders play a larger role in video understanding than current benchmarks suggest. We will fully open-source our work at https://github.com/naver-ai/LLaVA-AV-SSM.

Do Modern Video-LLMs Need to Listen? A Benchmark Audit and Scalable Remedy

TL;DR

Across 10 benchmarks -- with and without filtering -- audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected.

Abstract

Speech and audio encoders developed over years of community effort are routinely excluded from video understanding pipelines -- not because they fail, but because benchmarks never required listening. We audit 10 video benchmarks and find items largely solvable from visual cues alone: a single-frame probe answers ~77% of AVQA without audio, suggesting poor measurement of audio-visual reasoning. Building on LLaVA-OneVision, we attach a speech/audio encoder and compare five compressor architectures under 25x token reduction (25 Hz to 1 Hz). Across 10 benchmarks -- with and without filtering -- audio yields clear gains on tasks requiring speech comprehension or cross-modal grounding, while vision-centric suites remain largely unaffected. Our results show that speech encoders play a larger role in video understanding than current benchmarks suggest. We will fully open-source our work at https://github.com/naver-ai/LLaVA-AV-SSM.

Paper Structure

This paper contains 11 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Three input policies for feeding vision and audio tokens to the LLM. Audio tokens may be compressed via Fig. \ref{['fig:compressor']}.
  • Figure 2: Mamba-based audio compressor with an AVSpeakerBench avspeakerbench example. The question (yellow) requires listening; the answer (red) requires watching---answering demands both modalities. A periodic query every $R$ tokens yields $R{\times}$ reduction ($R{=}25 \approx 1$ token/s).
  • Figure 3: Fraction of items solvable from a single muted frame (GPT-4o, two runs at different temperatures, both correct). Red: ${\geq}$50%; orange: 30--50%; blue: ${<}$30%.
  • Figure 4: (a) Audio token count for a one-hour video; without compression, the encoder produces 90K tokens (e.g., Qwen2.5-Omni). (b) Avg. filtered score (10 benchmarks) vs. compression ratio. UniMambaMia degrades less at $25\times$ ($-$0.6 pp vs. $-$1.8 pp).