Table of Contents
Fetching ...

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang

Abstract

Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

Abstract

Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.
Paper Structure (19 sections, 10 figures, 4 tables)

This paper contains 19 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: We introduce LVOmniBench to provide a rigorous assessment of the performance of OmniLLMs on long audio-visual inputs, comprising strictly manually curated videos and annotations. Each question is assigned a specific difficulty level to facilitate hierarchical evaluation of model performance. In the first example, the model should comprehend the entire audio-visual context. Solving this question initially requires cross-modal alignment to identify "Toby" and finally necessitates visual counting and scene recognition. Even SoTA models, such as Gemini 3 Pro, struggle to answer this question correctly. The second example presents questions at two additional difficulty tiers, demonstrating that a correct answer requires a combination of audio and visual.
  • Figure 2: The construction of LVOmniBench follows a rigorous pipeline encompassing video collection, filtering, and question annotation. To guarantee both high-fidelity data quality and sufficient challenge for OmniLLMs, every component, from raw videos to the final questions, underwent meticulous manual selection and annotation.
  • Figure 3: Distribution of videos. Left: Videos in LVOmniBench span five primary categories encompassing 21 fine-grained subcategories. Each video is selected to ensure sufficient audio-visual information and dynamic variations. Right: The durations of the collected videos range from 10 to 90 minutes, with most between 20 and 50 minutes.
  • Figure 4: Distribution of questions. Left: LVOmniBench comprises nine question subcategories, with each demonstrating a balanced distribution across difficulty levels and video durations. Right: This panel illustrates the distribution of question difficulty and the corresponding audio types required to answer the questions.
  • Figure 5: Comparison between proprietary models and open-source models on LVOmniBench across different tasks. The blue, red, orange, and gray in the outer circle stands for perception, logical, inference, and understanding tasks. As can be observed, proprietary models demonstrate a substantial performance advantage over their open-source counterparts. Furthermore, across all models, significant vulnerabilities remain in specific sub-tasks, notably Counting and Music Perception.
  • ...and 5 more figures