Table of Contents
Fetching ...

MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX

Liuyue Xie, Avik Kuthiala, George Z. Wei, Ce Zheng, Ananya Bal, Mosam Dabhi, Liting Wen, Taru Rustagi, Ethan Lai, Sushil Khyalia, Rohan Choudhury, Morteza Ziyadi, Xu Zhang, Hao Yang, László A. Jeni

TL;DR

MAVERIX addresses the lack of standardized audiovisual benchmarks for multimodal LLMs by introducing a comprehensive testbed with 2,556 QA pairs drawn from 700 videos, designed to require tight video–audio integration. The benchmark uses a dual-format evaluation (eight‑option MCQs and open‑ended prompts) across localized and full‑video contexts, accompanied by a rigorous QA and validation pipeline to prevent unimodal shortcuts. Experiments across 17 models show consistent multimodal gains over unimodal baselines but reveal a sizable gap to human performance, especially on longer, socially nuanced tasks. MAVERIX provides a public toolkit and data release to drive advances in cross‑modal reasoning, temporal understanding, and context‑aware perception in multimodal LLMs.

Abstract

We introduce MAVERIX (Multimodal audiovisual Evaluation and Recognition IndeX), a unified benchmark to probe the video understanding in multimodal LLMs, encompassing video, audio, text inputs with human performance baselines. Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework to thoroughly assess their cross-modality comprehension performance. MAVERIX curates 2,556 questions from 700 videos, in the form of both multiple-choice and open-ended formats, explicitly designed to evaluate multimodal models through questions that necessitate tight integration of video and audio information, spanning a broad spectrum of agentic scenarios. MAVERIX uniquely provides models with audiovisual questions, closely mimicking the multimodal perceptual experiences available to humans during inference and decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration in such granularity. Experiments with state-of-the-art models, including Qwen 2.5 Omni and Gemini 2.5 Flash-Lite, show performance around 64% accuracy, while human experts reach near-ceiling performance of 92.8%, exposing a substantial gap to human-level comprehension. With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence.

MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX

TL;DR

MAVERIX addresses the lack of standardized audiovisual benchmarks for multimodal LLMs by introducing a comprehensive testbed with 2,556 QA pairs drawn from 700 videos, designed to require tight video–audio integration. The benchmark uses a dual-format evaluation (eight‑option MCQs and open‑ended prompts) across localized and full‑video contexts, accompanied by a rigorous QA and validation pipeline to prevent unimodal shortcuts. Experiments across 17 models show consistent multimodal gains over unimodal baselines but reveal a sizable gap to human performance, especially on longer, socially nuanced tasks. MAVERIX provides a public toolkit and data release to drive advances in cross‑modal reasoning, temporal understanding, and context‑aware perception in multimodal LLMs.

Abstract

We introduce MAVERIX (Multimodal audiovisual Evaluation and Recognition IndeX), a unified benchmark to probe the video understanding in multimodal LLMs, encompassing video, audio, text inputs with human performance baselines. Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework to thoroughly assess their cross-modality comprehension performance. MAVERIX curates 2,556 questions from 700 videos, in the form of both multiple-choice and open-ended formats, explicitly designed to evaluate multimodal models through questions that necessitate tight integration of video and audio information, spanning a broad spectrum of agentic scenarios. MAVERIX uniquely provides models with audiovisual questions, closely mimicking the multimodal perceptual experiences available to humans during inference and decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration in such granularity. Experiments with state-of-the-art models, including Qwen 2.5 Omni and Gemini 2.5 Flash-Lite, show performance around 64% accuracy, while human experts reach near-ceiling performance of 92.8%, exposing a substantial gap to human-level comprehension. With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence.

Paper Structure

This paper contains 26 sections, 20 figures, 10 tables.

Figures (20)

  • Figure 1: An illustration of our proposed benchmark, which includes highly audiovisual correlated questions and paraphrased questions, can be used to evaluate the model's underlying comprehension abilities and their gaps to humans.
  • Figure 2: Example Agentic Categories and corresponding QAs in the MAVERIX benchmark.
  • Figure 3: The framework to construct annotation sets with hybrid annotator and MLLM-as-judge quality assurance.
  • Figure 4: Impact of Video Length on Agentic Category Performances. (a) Accuracy change (%) from short to full-length videos across models and questions. (b) Accuracy gap (%) to human performance for short and full inputs.
  • Figure 5: Error analysis showing that o1 fails to correctly answer the question when the audio cannot be transcribed into text-based subtitles, leading to an incorrect connection between the cityscape video and the instrumental music.
  • ...and 15 more figures